planet.freedesktop.org
 July 23, 2014
I've just pushed the vc4-sim-validate branch to my Mesa tree. It's the culmination of the last week's worth pondering and false starts since I got my first texture sampling in simulation last Wednesday.

Handling texturing on vc4 safely is a pain. The pointer to texture contents doesn't appear in the normal command stream, and instead it's in the uniform stream. Which uniform happens to contain the pointer depends on how many uniforms have been loaded by the time you get to the QPU_W_TMU[01]_[STRB] writes. Since there's no iommu, I can't trust userspace to tell me where the uniform is, otherwise I'd be allowing them to just lie and put in physical addresses and read arbitrary system memory.

This meant I had to write a shader parser for the kernel, have that spit out a collection of references to texture samples, switch the uniform data from living in BOs in the user -> kernel ABI and instead be passed in as normal system memory that gets copied to the temporary exec bo, and then do relocations on that.

Instead of trying to write this in the kernel, with a ~10 minute turnaround time per test run, I copied my kernel code into Mesa with a little bit of wrapper code to give a kernel-like API environment, and did my development on that. When I'm looking at possibly 100s of iterations to get all the validation code working, it was well worth the day spent to build that infrastructure so that I could get my testing turnaround time down to about 15 sec.

I haven't done actual validation to make sure that the texture samples don't access outside of the bounds of the texture yet (though I at least have the infrastructure necessary now), just like I haven't done that validation for so many other pointers (vertex fetch, tile load/stores, etc.). I also need to copy the code back out to the kernel driver, and it really deserves some cleanups to add sanity to the many different addresses involved (unvalidated vaddr, validated vaddr, and validated paddr of the data for each of render, bin, shader recs, uniforms). But hopefully once I do that, I can soon start bringing up glamor on the Pi (though I've got some major issue with tile allocation BO memory management before anything's stable on the Pi).
 July 22, 2014

# Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).

# Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

## Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

## Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

To get to the above state, something like the following would happen.

1. Create BOx
2. Create BOy
3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
4. Do one of aforementioned operations on BOx and BOy
5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

## Future: No relocations

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and
kernels executing on devices to directly share complex, pointer-containing data structures such
as trees and linked lists. It also eliminates the need to marshal data between the host and devices.
As a result, SVM substantially simplifies OpenCL programming and may improve performance."


# Makin’ it Happen

## 64b

As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

## “This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 5122 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops


### Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
}

static inline int pte_present(pte_t a)
{
return pte_flags(a) &amp; (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
}

static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current-&gt;mm;

printk(&quot;Page is present: %s\n&quot;, pte_present(*ptep) ? &quot;yes&quot; : &quot;no&quot;);


X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

### Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

• PML4[0-255], each point to a PDP
• PDP[0-255][0-511], each point to a PD
• PD[0-255][0-511][0-511], each point to a PT
• PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0 is free.
2. [i915]Allocate PDP for PML4[0]
3. [i915]Allocate PD for PDP[0][0]
4. [i915]Allocate PT for PD[0][0][0]/li>
5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
3. [i915] Dispatch work to the GPU on behalf of mesa.
4. [i915] Observe the hardware has completed
5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0×200000 is free.
2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
4. Abort.

### Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
$\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes$
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
$256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \\ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\ 4294967296bytes + 8405024bytes = 4303372320bytes \\ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G$

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

#### Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
...
}

/**
* alloc_page_tables - Allocate all page tables for the given virtual address.
*
* This will allocate all the necessary page tables to map exactly one page at
* @address. The page tables will not be connected, and the PTE will not point
* to a page.
*
* @ppgtt:	The PPGTT structure encapsulating the virtual address space.
*
*/
static void
{
struct i915_pagetab *pt;
struct i915_pagedir *pd;
struct i915_pagedirpo *pdp;
struct i915_pml4 *pml4 = &amp;ppgtt-&gt;pml4; /* Always there */

int pte = (address &amp; I915_PDES_PER_PD);

if (!test_bit(pml4e, pml4-&gt;used_pml4es))
goto alloc;

pdp = pml4-&gt;pagedirpo[pml4e];
if (!test_bit(pdpe, pdp-&gt;used_pdpes;))
goto alloc;

pd = pdp-&gt;pagedirs[pdpe];
if (!test_bit(pde, pd-&gt;used_pdes)
goto alloc;

pt = pd-&gt;page_tables[pde];
if (test_bit(pte, pt-&gt;used_ptes))
return;

alloc_pdp:
pdp = alloc_one_pdp(pml4, pml4e);
set_bit(pml4e, pml4-&gt;used_pml4es);
alloc_pd:
pd = alloc_one_pd(pdp, pdpe);
set_bit(pdpe, pdp-&gt;used_pdpes);
alloc_pt:
pt = alloc_one_pt(pd, pde);
set_bit(pde, pd-&gt;used_pdes);
}


Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

# The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
__u64 user_ptr;
__u64 user_size;
__u32 ctx_id;
__u32 flags;
#define I915_USERPTR_GPU_MIRROR         (1&lt;&lt;1)
#define I915_USERPTR_UNSYNCHRONIZED     (1&lt;&lt;31)
/**
* Returned handle for the object.
*
* Object handles are nonzero.
*/
__u32 handle;
};


The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

• Pros:
• This interface is very simple.
• Existing userptr code does the hard work for us
• Cons:
• You need 1 IOCTL per object. Much undeeded overhead.
• It’s subject to a lot of problems userptr has5
• Userptr was already merged, so unless pad get’s repruposed, we’re screwed

## What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

# Wrapping it up (all 4 parts)

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN"

2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)

3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect.

4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took

5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring

 July 21, 2014

## Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren’t terrible, this isn’t as useful as it once was, and because this uses Glamor in a weird way, we’re making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

### Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding “_uxa” to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new ‘intel_uxa.h” file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

### Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
int ok = 0;

if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
ok = glamor_fill_spans_nf(pDrawable,
pGC, n, ppt, pwidth, fSorted);
uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
}

if (!ok)
goto fallback;

return;
}


Pulling those out shrank the UXA code by quite a bit.

### Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an ‘accel’ variable which could be one of three options:

• ACCEL_GLAMOR.
• ACCEL_UXA.
• ACCEL_NONE

I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

### Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we’ve got lots of places that look like:

        switch (intel->accel) {
#if USE_GLAMOR
case ACCEL_GLAMOR:
if (!intel_glamor_create_screen_resources(screen))
return FALSE;
break;
#endif
#if USE_UXA
case ACCEL_UXA:
if (!intel_uxa_create_screen_resources(screen))
return FALSE;
break;
#endif
case ACCEL_NONE:
if (!intel_none_create_screen_resources(screen))
return FALSE;
break;
}


Using a switch means that we can easily elide code that isn’t wanted in a particular build. Of course ‘accel’ is an enum, so places which are missing one of the required paths will cause a compiler warning.

It’s not all perfectly clean yet; there are piles of UXA-only paths still.

### Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef’s in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
-----------             ------          -------
none                      7143            73039
glamor                    7397            76540
uxa                      25979           283777
sna                     118832          1303904

none legacy              14449           152480
glamor legacy            14703           156125
uxa legacy               33285           350685
sna legacy              126138          1395231


The ‘legacy’ addition supports i810-class hardware, which is needed for a complete driver.

### Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven’t analyzed what effect this has on performance.

### Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It’s generally the fastest way of updating the screen as you don’t have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that’s pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,
pixmap,
&stride,
&size);

bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
close(fd);
intel_glamor_get_pixmap(pixmap)->bo = bo;


That last bit remembers the bo in some local memory so we don’t have to do this more than once for each pixmap. glamorfdfrompixmap ends up calling eglCreateImageKHR followed by gbmbo_import and then a kernel ioctl to convert a prime handle into an fd. It’s all quite round-about, but it does seem to work just fine.

After I’d gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn’t working. That turned out to have an unexpected consequence — all full-screen applications would run flat-out, and not be limited to frame rate. Present ‘recovers’ from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

### Painting Root before Mode set

The X server has generally done initialization in one order:

1. Create root pixmap
2. Set video modes
3. Paint root window

Recently, we’ve added a ‘-background none’ option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can’t use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

1. Create root pixmap
2. Paint root window
3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we’re not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy — just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge — I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn’t call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.

### Performance

I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor’s image functions, and the UXA GetImage is pretty slow. Using Mesa’s image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

1                 2                 Operation
------------   -------------------------   -------------------------
50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square
12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square
1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square
3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square
36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square
1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square
49800.0        50200.0 (     1.008)   GetImage 10x10 square
5690.0        19300.0 (     3.392)   GetImage 100x100 square
609.0         1360.0 (     2.233)   GetImage 500x500 square
3100.0          206.0 (     0.066)   GetImage XY 10x10 square
36.4          183.0 (     5.027)   GetImage XY 100x100 square
1.5           55.4 (    36.933)   GetImage XY 500x500 square


Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

1                 2                 Operation
------------   -------------------------   -------------------------
43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square
2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square
130.0         4250.0 (    32.692)   ShmGetImage 500x500 square
3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square
36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square
1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square
41700.0        50200.0 (     1.204)   GetImage 10x10 square
2520.0        19300.0 (     7.659)   GetImage 100x100 square
125.0         1360.0 (    10.880)   GetImage 500x500 square
3150.0          206.0 (     0.065)   GetImage XY 10x10 square
36.1          183.0 (     5.069)   GetImage XY 100x100 square
1.5           55.4 (    36.933)   GetImage XY 500x500 square


Of course, this is all just x11perf, which doesn’t represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it’s nice to have this kind of speed up.

### Status

I’m running this on my crash box to get some performance numbers and continue testing it. I’ll switch my desktop over when I feel a bit more comfortable with how it’s working. But, I think it’s feature complete at this point.

### Where’s the Code

As usual, the code is in my personal repository. It’s on the ‘glamor’ branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor

 July 19, 2014

Hello,

As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won’t be present in my implementation. That’s why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.

 July 17, 2014

Two years ago, I got appointed as chairman of the openSUSE Board. I was very excited about this opportunity, especially as it allowed me to keep contributing to openSUSE, after having moved to work on the cloud a few months before. I remember how I wanted to find new ways to participate in the project, and this was just a fantastic match for this. I had been on the GNOME Foundation board for a long time, so I knew it would not be easy and always fun, but I also knew I would pretty much enjoy it. And I did.

Fast-forward to today: I'm still deeply caring about the project and I'm still excited about what we do in the openSUSE board. However, some happy event to come in a couple of months means that I'll have much less time to dedicate to openSUSE (and other projects). Therefore I decided a couple of months ago that I would step down before the end of the summer, after we'd have prepared the plan for the transition. Not an easy decision, but the right one, I feel.

And here we are now, with the official news out: I'm no longer the chairman :-) (See also this thread) Of course I'll still stay around and contribute to openSUSE, no worry about that! But as mentioned above, I'll have less time for that as offline life will be more "busy".

openSUSE Board Chairman at oSC14

Since I mentioned that we were working on a transition... First, knowing the current board, I have no doubt everything will be kept pushed in the right direction. But on top of that, my good friend Richard Brown has been appointed as the new chairman. Richard knows the project pretty well and he has been on the board for some time now, so is aware of everything that's going on. I've been able to watch his passion for the project, and that's why I'm 100% confident that he will rock!

Anandtech recently went all out on the ARM midgard architecture (Mali T series). This was quite astounding, as ARM MPD tends to be a pretty closed shop. The Anandtech coverage included an in-depth view of the Mali Midgard GPU, a (short) Q&A session with Jem Davies (the head honcho of ARM MPD, ARMs Media Processing Division, the part of ARM that develops the Mali and display and video engines) and a google hangout with Jem Davies a week later.

This set of articles does not seem like the sort of thing that ARM MPD would have initiated itself. Since both Imagination Technologies and NVidia did something similar months earlier, my feeling is that this was either initiated by Anand Lal Shimpi himself, or that this was requested by ARM marketing in response to the other articles.

Several interesting observations can be made from this though, especially from the answers (or sometimes, lack thereof) to the Q&A and google hangout sessions.

## Hiding behind Linaro.

First off, Mr Davies still does not see an open source driver as a worthwhile endeavour for ARM MPD, and this is a position that hasn't changed since i started the lima driver, when my former employer went and talked to ARM management. Rumour has it that most of ARMs engineers both in MPD and other departments would like this to be different, and that Mr Davies is mostly alone with his views, but that's currently just hearsay. He himself states that there are only business reasons against an open source driver for the mali.

To give some weight to this, Mr Davies stated that he contributed to the linux kernel, and i called him out on that one, as i couldn't find any mention of him in a kernel git tree. It seems however that his contributions are from during the Bitkeeper days, and that the author trail on those changes probably got lost. But, having contributed to a project at one point or another is, to me, not proof that one actively supports the idea of open source software, at best it proves that adding support to the kernel for a given ARM device or subsystem was simply necessary at one point.

Mr Davies also talked about how ARM is investing a lot in linaro, as a proof of ARMs support of open source software. Linaro is a consortium to further linux on ARM, so per definition ARM plays a very big role in it. But it is not ARM MPD that drives linaro, it is ARM itself. So this is not proof of ARM MPD actively supporting open source software. Mr Davies did not claim differently, but this distinction should be made very clear in this context.

Then, linaro can be described as an industry consortium. For non-founding members of a consortium, such a construction is often used to park some less useful people while gaining the priviledge to claim involvement as and when desired. The difference to other consortiums is that most of the members come from a deeply embedded background, where the word "open" never was spoken before, and, magically, simply by having joined linaro, those deeply embedded companies now feel like they succesfully ticked the "open source" box on their marketing checklist. Several of linaros members are still having severe difficulty conforming to the GPL, but they still proudly wear the linaro badge as proof of their open source...ness?

As a prominent member of the sunxi community, I am most familiar with Allwinner, a small chinese cheap SoC designer. At the start of the year, we were seeing some solid signs of Allwinner opening up to our community directly. In March however, Allwinner joined linaro and people were hopeful that this meant that a new era of openness had started for Allwinner. As usual, I was the only cynical voice and i warned that this could mean that Allwinner now wouldn't see the need to further engage with us. Ever since, we haven't been able to reach our contacts inside Allwinner anymore, and even our requests for compliance with the GPL get ignored.

Linaro membership does not absolve from limited open source involvement or downright license violation, but for many members, this is exactly how it is used. Linaro seems to be a get-out-of-jail-free card for several of its members. Linaro membership does not need to prove anything, linaro membership even seems to have the opposite effect in several cases.

ARM driving linaro is simply no proof that ARM MPD supports open source software.

## The patent excuse.

I am amazed that people still attempt to use this as an argument against open source graphics drivers.

Usually this is combined with the claim that open source drivers are exposing too much of the inner workings of the hardware. But this logic in itself states that the hardware is the problem, not the software. The hardware itself might or might not have patent issues, and it is just a matter of time before the owner of said breached patents will come a-knocking. At best, an open source driver might speed up the discovery of said issues, but the driver itself never is the cause, as the problems will have been there all along.

One would actually think that the Anandtech article about the midgard architecture would reveal more about the hardware, and trigger more litigation, than the lima driver could ever do, especially given how neatly packaged an in depth anandtech article is. Yet ARM MPD seemed to have had no issue with exposing this much information in their marketing blitz.

I also do not believe that patents are such a big issue. If graphics hardware patents were such big business, you would expect that an industry expert in graphics, especially one who is a dab hand at reverse engineering, would be contacted all the time to help expose potential patent issues. Yet i never have been contacted, and i know of no-one who ever has been.

Similarly. the first bits of lima code were made available 2.5 years ago, with bits trickling out slowly (much to my regret), and there are still several unknowns today. If lima played any role in patent disputes, you would again expect that i would be asked to support those looking to assert their patents. Again, nothing.

GPU Patents are just an excuse, nothing more.

When I was at SuSE, we freed ATI for AMD, and we never did hear that excuse. AMD wanted a solid open source strategy for ATI as ATI was not playing ball after the merger, and the bad publicity was hurting server (CPU) sales. Once the decision was made to go the open source route, patents suddenly were not an issue anymore. We did however have to deal with IP issues (or actually, AMD did - we made very sure we didn't get anything that wasn't supposed to be free), such as HDCP and media decoding, which ATI was not at liberty to make public. Given the very heated war that ATI and Nvidia fought at the time, and the huge amount of revenue in this market, you would think that ATI would be a very likely candidate for patent litigation, yet this never stood in the way of an open source driver.

There is another reason as to why patents are that popular an excuse. The words "troll" and "legal wrangling" are often sprinkled around as well so that images of shady deals being made by lawyers in smokey backrooms usually come to mind. Yet we never get to hear the details of patent cases, as even Mr Davies himself states that ARM is not making details available of ongoing cases. I also do not know of any public details on cases that have been closed already (not that i have actively looked - feel free to enlighten me). Patents are a perfect blanket excuse where proof apparently does not seem to be required.

We open source developers are very much aware of the damage that software patents do, and this makes the patent weapon perfect for deployment against those who support open source software. But there is a difference between software patents and the patent cases that ARM potentially has to deal with on the Mali. Yet we seem to have made patents our own kryptonite, and are way too easily lulled into backing off at the first mention of the word patent.

Patents are a poor excuse, as there is no direct relationship between an open source driver and the patent litigation around the hardware.

## The Resources discussion.

As a hardware vendor (or IP provider) doing a free software driver never is for free. A lot of developer time does need to be invested, and this is an ongoing commitment. So yes, a viable open source driver for the Mali will consume some amount of resources.

Mr Davies states that MPD would have to incur this cost on its own, as MPD seems to be a completely separate unit and that further investment can only come from profit made within this group. In light of that information, I must apologize for ever having treated ARM and ARM MPD as one and the same with respect to this topic. I will from now on make it very clear that it is ARM MPD, and ARM MPD alone, that doesn't want an open source mali driver.

I do believe that Mr Davies his cost versus gain calculations are too direct and do not allow for secondary effects.

I also believe that an ongoing refusal to support an open source strategy for Mali will reflect badly on the sale of ARM processors and other IP, especially with ARM now pushing into the server market and getting into intel territory. The actions of ARM MPD do affect ARM itself, and vice versa. Admittedly, not as much as some with those that more closely tie the in-house GPU with the rest of the system, but that's far from an absolute lack of shared dependency and responsibility.

## The Mali binary problem.

One person in the Q&A section asked why ARM isn't doing redistributable drivers like Nvidia does for the Tegra. Mr Davies answered that this was a good idea, and that linaro was doing something along those lines.

Today, ironically, I am the canonical source for mali-400 binaries. At the sunxi project, we got some binaries from the Cubietech people, built from code they received from allwinner, and the legal terms they were under did not prevent them from releasing the built binaries to the public. Around them (or at least, using the binaries as a separate git module) I built a small make based installation system which integrates with ARMs open source memory manager (UMP) and even included a quick GLES test from the lima tests. I stopped just short of debian packaging. The sunxi-mali repository, and the wiki tutorial that goes with it, now is used by many other projects (like for instance linux-rockchip) as their canonical source for (halfway usable) GPU support.

There are several severe problems with these binaries, which we have either fixed directly, have been working around or just have to live with. Direct fixes include adding missing library dependencies, and hollowing out a destructor function which made X complain. These are binary hacks. The xf86-video-fbturbo driver from Siarhei Siamashka works around the broken DRI2 buffer management, but it has to try to autodetect how to work around the issues, as it is differently broken on the different versions of the X11 binaries we have. Then there is the flaky coverage, as we only have binaries for a handful of kernel APIs, making it impossible to match them against all vendor provided SoC/device kernels. We also only have binaries for fbdev or X11, and sometimes for android, mostly for armhf, but not always... It's just one big mess, only slightly better than having nothing at all.

Much to our surprise, in oktober of last year, ARM MPD published a howto entry about setting up a working driver for mali midgard on the chromebook. It was a step in the right direction, but involved quite a bit off faff, and Connor Abbott (the brilliant teenager REing the mali shaders) had to go and pour things into a proper git repository so that it would be more immediately useful. Another bout of insane irony, as this laudable step in the right direction by ARM MPD ultimately left something to be desired.

ARM MPD is not like ATI, Nvidia, or even intel, qualcomm or broadcom. The Mali is built into many very different SoC families, and needs to be integrated with different display engines, 2D engines, media engines and memory/cache subsystems.

Even the distribution of drivers is different. From what i understand, mali drivers are handled as follows. The Mali licensees get the relevant and/or latest mali driver source code and access to some support from ARM MPD. The device makers, however, only rarely get their hands on source code themselves and usually have to make do with the binaries provided by the SoC vendor. Similarly, the device maker only rarely gets to deal with ARM MPD directly, and usually needs to deal with some proxy at the SoC vendor. This setup puts the responsibility of SoC integration squarely at the SoC vendor, and is well suited for the current mobile market: one image per device at release, and then almost no updates. But that market is changing with the likes of Cyanogenmod, and other markets are opening or are actively being opened by ARM, and those require a completely different mode of operation.

There is gap in Mali driver support that ARM MPDs model of driver delivery does not cater for today, and ARM MPD knows about this. But MPD is going to be fighting an upbill battle to try to correct this properly.

## Binary solutions?

So how can ARM MPD try to tackle this problem?

Would ARM MPD keep the burden of making suitable binaries available solely with SoC vendors or device makers? Not likely as that is a pretty shakey affair that's actively hurting the mali ecosystem. SoCs for the mobile market have incredibly short lives, and SoC and device software support is so fragmented that these vendors would be responsible for backporting bugfixes to a very wide array of kernels and SoC versions. On top of that, those vendors would only support a limited subset of windowing systems, possibly even only android as this is their primary market. Then, they would have to set up the support infrastructure to appropriately deal with user queries and bug reports. Only very few vendors will end up even attempting to do this, and none are doing so today. In the end, any improvement at this end will bring no advantages to the mali brand or ARM MPD. If this path is kept, we will not move on from the abysmal situation we are in today, and the Mali will remain to be seen as a very fragmented product.

ARM MPD has little other option but to try to tackle this itself, directly, and it should do so more proactively than by hiding behind linaro. Unfortunately, to make any real headway here, this means providing binaries for every kernel driver interface, and the SoC vendor changes to those interfaces, on top of other bits of SoC specific integration. But this also means dealing with user support directly, and these users will of course spend half their time asking questions which should be aimed at the SoC vendor. How is ARM MPD going to convince SoC vendors to participate here? Or is ARM MPD going to maintain most of the SoC integration work themselves? Surely it will not keep the burden only at linaro, wasting the resources of the rest of ARM and of linaro partners?

ARM MPD just is in a totally different position than the ATIs and Nvidias of this world. Providing binaries that will satisfy a sufficient part of the need is going to be a huge drain on resources. Sure, MPD is not spending the same amount of resources on optimizing for specific setups and specific games like ATI or Nvidia are doing, but they will instead have to spend it on the different SoCs and devices out there. And that's before we start talking about different windowing infrastructure, beyond surfaceflinger, fbdev or X11. Think wayland, mir, even directFB, or any other special requirements that people tend to have for their embedded hardware.

At best, ARM MPD itself will manage to support surfaceflinger, fbdev and X11 on just a handful of popular devices. But how will ARM MPD know beforehand which devices are going to be popular? How will ARM MPD keep on making sure that the binaries match the available vendor or device kernel trees? Would MPD take the insane route of maintaining their own kernel repositories with a suitable mali kernel driver for those few chosen devices, and backporting changes from the real vendor trees instead? No way.

Attempting to solve this very MPD specific problem with only binaries, to any degree of success, is going to be a huge drain on MPD resources, and in the end, people will still not be satisfied. The problem will remain.

The only fitting solution is an open source driver. Of course, the Samsungs of this world will not ship their flagship phones with just an open source GPU driver in the next few years. But an open source driver will fundamentally solve the issues people currently have with Mali, the issues which fuel both the demand for fitting distributable binaries and for an open source driver. Only an open source driver can be flexible and cost-effective enough to fill that gap. Only an open source driver can get silicon vendors, device makers, solution creators and users chipping in, satisfying their own, very varied, needs.

## Change is coming.

The ARM world is rapidly changing. Hardware review sites, which used to only review PC hardware, are more and more taking notice of what is happening in the mobile space. Companies that are still mostly stuck in embedded thinking are having to more and more act like PC hardware makers. The lack of sufficiently broad driver support is becoming a real issue, and one that cannot be solved easily or cheaply with a quick binary fix, especially for those who sell no silicon of their own.

The Mali marketing show on Anandtech tells us that things are looking up. The market is forcing ARM MPD to be more open, and MPD has to either sink or swim. The next step was demonstrated by yours truly and some other very enterprising individuals, and now both Nvidia and Broadcom are going all the way. It is just a matter of time before ARM MPD has to follow, as they need this more than their more progressive competitors.

To finish off, at the end of the Q&A session, someone asked: "Would free drivers gives greater value to the shareholders of ARM?". After a quick braindump, i concluded "Does ARMs lack of free drivers hurt shareholder value?" But we really should be stating "To what extent does ARMs lack of free drivers hurt shareholder value?".
 July 16, 2014

Today I am very happy to announce the release of AppStream 0.7, the second-largest release (judging by commit number) after 0.6. AppStream 0.7 brings many new features for the specification, adds lots of good stuff to libappstream, introduces a new libappstream-qt library for Qt developers and, as always, fixes some bugs.

Unfortunately we broke the API/ABI of libappstream, so please adjust your code accordingly. Apart from that, any other changes are backwards-compatible. So, here is an overview of what’s new in AppStream 0.7:

### Specification changes

Distributors may now specify a new <languages/> tag in their distribution XML, providing information about the languages a component supports and the completion-percentage for the language. This allows software-centers to apply smart filtering on applications to highlight the ones which are available in the users native language.

A new addon component type was added to represent software which is designed to be used together with a specific other application (think of a Firefox addon or GNOME-Shell extension). Software-center applications can group the addons together with their main application to provide an easy way for users to install additional functionality for existing applications.

The <provides/> tag gained a new dbus item-type to expose D-Bus interface names the component provides to the outside world. This means in future it will be possible to search for components providing a specific dbus service:

appstream-index what-provides dbus org.freedesktop.PackageKit.desktop system (if you are using the cli tool) A <developer_name/> tag was added to the generic component definition to define the name of the component developer in a human-readable form. Possible values are, for example “The KDE Community”, “GNOME Developers” or even the developer’s full name. This value can be (optionally) translated and will be displayed in software-centers. An <update_contact/> tag was added to the specification, to provide a convenient way for distributors to reach upstream to talk about changes made to their metadata or issues with the latest software update. This tag was already used by some projects before, and has now been added to the official specification. Timestamps in <release/> tags must now be UNIX epochs, YYYYMMDD is no longer valid (fortunately, everyone is already using UNIX epochs). Last but not least, the <pkgname/> tag is now allowed multiple times per component. We still recommend to create metapackages according to the contents the upstream metadata describes and place the file there. However, in some cases defining one component to be in multiple packages is a short way to make metadata available correctly without excessive package-tuning (which can become difficult if a <provides/> tag needs to be satisfied). As small sidenote: The multiarch path in /usr/share/appdata is now deprecated, because we think that we can live without it (by shipping -data packages per library and using smarter AppStream metadata generators which take advantage of the ability to define multiple <pkgname/> tags) ### Documentation updates In general, the documentation of the specification has been reworked to be easier to understand and to include less duplication of information. We now use excessive crosslinking to show you the information you need in order to write metadata for your upstream project, or to implement a metadata generator for your distribution. Because the specification needs to define the allowed tags completely and contain as much information as possible, it is not very easy to digest for upstream authors who just want some metadata shipped quickly. In order to help them, we now have “Quickstart pages” in the documentation, which are rich of examples and contain the most important subset of information you need to write a good metadata file. These quickstart guides already exist for desktop-applications and addons, more will follow in future. We also have an explicit section dealing with the question “How do I translate upstream metadata?” now. More changes to the docs are planned for the next point releases. You can find the full project documentation at Freedesktop. ### AppStream GObject library and tools The libappstream library also received lots of changes. The most important one: We switched from using LGPL-3+ to LGPL-2.1+. People who know me know that I love the v3 license family of GPL licenses – I like it for tivoization protection, it’s explicit compatibility with some important other licenses and cosmetic details, like entities not loosing their right to use the software forever after a license violation. However, a LGPL-3+ library does not mix well with projects licensed under other open source licenses, mainly GPL-2-only projects. I want libappstream to be used by anyone without forcing the project to change its license. For some reason, using the library from proprietary code is easier than using it from a GPL-2-only open source project. The license change was also a popular request of people wanting to use the library, so I made the switch with 0.7. If you want to know more about the LGPL-3 issues, I recommend reading this blogpost by Nikos (GnuTLS). On the code-side, libappstream received a large pile of bugfixes and some internal restructuring. This makes the cache builder about 5% faster (depending on your system and the amount of metadata which needs to be processed) and prepares for future changes (e.g. I plan to obsolete PackageKit’s desktop-file-database in the long term). The library also brings back support for legacy AppData files, which it can now read. However, appstream-validate will not validate these files (and kindly ask you to migrate to the new format). The appstream-index tool received some changes, making it’s command-line interface a bit more modern. It is also possible now to place the Xapian cache at arbitrary locations, which is a nice feature for developers. Additionally, the testsuite got improved and should now work on systems which do not have metadata installed. Of course, libappstream also implements all features of the new 0.7 specification. With the 0.7 release, some symbols were removed which have been deprecated for a few releases, most notably as_component_get/set_idname, as_database_find_components_by_str, as_component_get/set_homepage and the “pkgname” property of AsComponent (which is now a string array and called “pkgnames”). API level was bumped to 1. ### Appstream-Qt A Qt library to access AppStream data has been added. So if you want to use AppStream metadata in your Qt application, you can easily do that now without touching any GLib/GObject based code! Special thanks to Sune Vuorela for his nice rework of the Qt library! And that’s it with the changes for now! Thanks to everyone who helped making 0.7 ready, being it feedback, contributions to the documentation, translation or coding. You can get the release tarballs at Freedesktop. Have fun!  July 14, 2014 Following Christian's Wayland in Fedora Update post, and after Hans fixed the touchpad acceleration, I've been playing with pointer acceleration in libinput a bit. The main focus was not yet on changing it but rather on figuring out what we actually do and where the room for improvement is. There's a tool in my (rather messy) github wip/ptraccel-work branchto re-generate the graphs below. This was triggered by a simple plan: I want a configuration interface in libinput that provides a sliding scale from -1 to 1 to adjust a device's virtual speed from slowest to fastest, with 0 being the default for that device. A user should not have to worry about the accel mechanism itself, which may be different for any given device, all they need to know is that the setting -0.5 means "halfway between default and 'holy cow this moves like molasses!'". The utopia is of course that for any given acceleration setting, every device feels equally fast (or slow). In order to do that, I needed the right knobs to tweak. The code we currently have in libinput is pretty much 1:1 what's used in the X server. The X server sports a lot more configuration options, but what we have in libinput 0.4.0 is essentially what the default acceleration settings are in X. Armed with the knowledge that any #define is a potential knob for configuration I went to investigate. There are two defines that are labelled as adjustible parameters: • DEFAULT_THRESHOLD, set to 0.4 • DEFAULT_ACCELERATION, set to 2.0 But what do they mean, exactly? And what exactly does a value of 0.4 represent? [side-note: threshold was 4 until I took the constant multiplier out, it's now 0.4 upstream and all the graphs represent that.] Pointer acceleration is nothing more than mapping some input data to some potentially faster output data. How much faster depends on how fast the device moves, and to get there one usually needs a couple of steps. The trick of course is to make it predictable, so that despite the acceleration, your brain thinks that the visible cursor is an extension of your hand at all speeds. Let's look at a high-level outline of our pointer acceleration code: • calculate the velocity of the current movement • use that velocity to calculate the acceleration factor • apply accel to dx/dy • smoothen out the dx/dy to avoid abrupt changes between two events ## Calculating pointer speed We don't just use dx/dy as values, rather, we use the pointer velocity. There's a simple reason for that: dx/dy depends on the device's poll rate (or interrupt frequency). A device that polls twice as often sends half the dx/dy values in each event for the same physical speed. Calculating the velocity is easy: divide dx/dy by the delta time. We use a set of "trackers" that store previous dx/dy values with their timestamp. As long as we get movement in the same cardinal direction, we take those into account. So if we have 5 events in direction NE, the speed is averaged over those 5 events, smoothing out abrupt speed changes. ## The acceleration function The speed we just calculated is passed to the acceleration function to calculate an acceleration factor. Figure 1: Mapping of velocity in unit/ms to acceleration factor (unitless). X axes here are labelled in units/ms and mm/s. This function is the only place where DEFAULT_THRESHOLD/DEFAULT_ACCELERATION are used, but they mostly just stretch the graph. The shape stays the same. The output of this function is a unit-less acceleration factor that is applied to dx/dy. A factor of 1 means leaving dx/dy untouched, 0.5 is half-speed, 2 is double-speed. Let's look at the graph for the accel factor output (red): for very slow speeds we have an acceleration factor < 1.0, i.e. we're slowing things down. There is a distinct plateau up to the threshold of 0.4, after that it shoots up to roughly a factor of 1.6 where it flattens out a bit until we hit the max acceleration factor Now we can also put units to the two defaults: Threshold is clearly in units/ms, and the acceleration factor is simply a maximum. Whether those are mentally easy to map is a different question. We don't use the output of the function as-is, rather we smooth it out using the Simpson's rule. The second (green) curve shows the accel factor after the smoothing took effect. This is a contrived example, the tool to generate this data simply increased the velocity, hence this particular line. For more random data, see Figure 2. Figure 2: Mapping of velocity in unit/ms to acceleration factor (unitless) for a random data set. X axes here are labelled in units/ms and mm/s. For the data set, I recorded the velocity from libinput while using Firefox a bit. The smoothing takes history into account, so the data points we get depend on the usage. In this data set (and others I tested) we see that the majority of the points still lie on or close to the pure function, apparently the delta doesn't matter that much. Nonetheless, there are a few points that suggest that the smoothing does take effect in some cases. It's important to note that this is already the second smoothing to take effect - remember that the velocity (may) average over multiple events and thus smoothens the input data. However, the two smoothing effects somewhat complement each other: velocity smoothing only happens when the pointer moves consistently without much change, the Simpson's smoothing effect is most pronounced when the pointer moves erratically. Ok, now we have the basic function, let's look at the effect. ## Pointer speed mappings Figure 3: Mapping raw unaccelerated dx to accelerated dx, in mm/s assuming a constant pysical device resolution of 400 dpi that sends events at 125Hz. dx range mapped is 0..127 The graph was produced by sending 30 events with the same constant speed, then dividing by the number of events to reduce any effects tracker feeding has at the initial couple of events. The two lines show the actual output speed in mm/s and the gain in mm/s, i.e. (output speed - input speed). We can see that the little nook where the threshold kicks in and after the acceleration is linear. Look at Figure 1 again: the linear acceleration is caused by the acceleration factor maxing out quickly. Most of this graph is theoretical only though. On your average mouse you don't usually get a delta greater than 10 or 15 and this graph covers the theoretical range to 127. So you'd only ever be seeing the effect of up to ~120 mm/s. So a more realistic view of the graph is: Figure 4: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event). Same data as Figure 3, but zoomed to the realistic range. We go from a linear speed increase (no acceleration) to a quick bump once the threshold is hit and from then on to a linear speed increase once the maximum acceleration is hit. And to verify, the ratio of output speed : input speed: Figure 5: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated. Looks pretty much exactly like the pure acceleration function, which is to be expected. What's important here though is that this is the effective speed, not some mathematical abstraction. And it shows one limitation: we go from 0 to full acceleration within really small window. Again, this is the full theoretical range, the more realistic range is: Figure 6: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated. Zoomed in to a max of 120 mm/s (15 dx/event). Same data as Figure 5, just zoomed in to a maximum of 120 mm/s. If we assume that 15 dx/event is roughly the maximum you can reach with a mouse you'll see that we've reached maximum acceleration at a third of the maximum speed and the window where we have adaptive acceleration is tiny. Tweaking threshold/accel doesn't do that much. Below are the two graphs representing the default (threshold=0.4, accel=2), a doubled threshold (threshold=0.8, accel=2) and a doubled acceleration (threshold=0.4, accel=4). Figure 6: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4. Figure 7: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, see Figure 5 for details. Zoomed in to a max of 120 t0.4 a4 (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4. Doubling either setting just moves the adaptive window around, it doesn't change that much in the grand scheme of things. Now, of course these were all fairly simple examples with constant speed, etc. Let's look at a diagram of what is essentially random movement, me clicking around in Firefox for a bit: Figure 8: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set. And the zoomed-in version of this: Figure 9: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set, zoomed in to events 450-550 of that set. This is more-or-less random movement reflecting some real-world usage. What I find interesting is that it's very hard to see any areas where smoothing takes visible effect. the accelerated curve largely looks like a stretched input curve. tbh I'm not sure what I should've expected here and how to read that, pointer acceleration data in real-world usage is notoriously hard to visualize. ## Summary So in summary: I think there is room for improvement. We have no acceleration up to the threshold, then we accelerate within too small a window. Acceleration stops adjusting to the speed soon. This makes us lose precision and small speed changes are punished quickly. Increasing the threshold or the acceleration factor doesn't do that much. Any increase in acceleration makes the mouse faster but the adaptive window stays small. Any increase in threshold makes the acceleration kick in later, but the adaptive window stays small. We've already merged a number of fixes into libinput, but some more work is needed. I think that to get a good pointer acceleration we need to get a larger adaptive window [Citation needed]. We're currently working on that (and figuring out how to evaluate whatever changes we come up with). ## A word on units The biggest issue I was struggling with when trying to understand the code was that of units. The code didn't document used units anywhere but it turns out that everything was either in device units ("mickeys"), device units/ms or (in the case of the acceleration factors) was unitless. Device units are unfortunately a pretty useless base entity, only slightly more precise than using the length of a piece of string. A device unit depends on the device resolution and of course that differs between devices. An average USB mouse tends to have 400 dpi (15.75 units/mm) but it's common to have 800 dpi, 1000 dpi and gaming mice go up to 8200dpi. A touchpad can have resolutions of 1092 dpi (43 u/mm), 3277 dpi (129 u/mm), etc. and may even have different resolutions for x and y. This explains why until commit e874d09b4 the touchpad felt slower than a "normal" mouse. We scaled to a magic constant of 10 units/mm, before hitting the pointer acceleration code. Now, as said above the mouse would likely have a resolution of 15.75 units/mm, making it roughly 50% faster. The acceleration would kick in earlier on the mouse, giving the touchpad and the mouse not only different speeds but a different feel altogether. Unfortunately, there is not much we can do about mice feeling different depending on the resolution. To my knowledge there is no way to query the resolution on a device. But for absolute devices that need pointer acceleration (i.e. touchpads) we can normalize to a fake resolution of 400 dpi and base the acceleration code on that. This provides the same feel on the mouse and the touchpad, as much as that is possible anyway.  July 13, 2014 • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it’s there at the bottom. • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images. • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense. • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated. The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It’s called, “true” because the Aliasing PPGTT was introduced first and often was simply called, “PPGTT.” The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn’t realistically be enabled in isolation of the existing driver. When regressions occur it’s likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you’re reading this in some future where the stability problems are fixed upstream. Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won’t be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2 # A Brief History of the i915 Graphics Process There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU. Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher. ## File Descriptors Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate “who” was doing “what.” This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it’s just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time. File descriptor isolation.Before hardware contexts. ## Hardware Contexts The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there’s not really any point in lamenting about how it could be better, now. The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple. struct drm_i915_gem_context_create { /* output: id of new context*/ __u32 ctx_id; __u32 pad; }; struct drm_i915_gem_context_destroy { __u32 ctx_id; __u32 pad; };  As you can see from the two IOCTL payloads above, I wasn’t lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn’t a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list. • GPU clients needed HW context preserved • Buggy applications writing to random memory • Malicious applications The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea. Hardware context isolation ## Full PPGTT The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect. PPGTT, full isolation If I write about this picture, there would be no point in continuing with an organized blog post :-). So I’ll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients • GPU clients needed HW context preserved • Buggy applications writing to random memory • Malicious applications Since the GGTT isn’t really mentioned much in this post, I’d like to point out that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post. # VMAs and Address Spaces (AKA VMs) The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn’t much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering. ## Sharing VMAs You can’t (see the note at the bottom). There’s not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I’ll keep things general and brief. It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client’s process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again: 1. VMA: (BO, Address Space) // Some BO mapped by the address space. 2. VMA′: (BO′, Address Space) // Another BO mapped into the address space 3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space. M = {1,2,3,…} N = {1,2,3,…} In case it’s still unclear, I’ll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object. NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can’t think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do. ## Data Structures Here are the relevant data structures cropped for the sake of brevity. struct i915_address_space { struct drm_mm mm; unsigned long start; /* Start offset always 0 for dri2 */ size_t total; /* size addr space maps (ex. 2GB for ggtt) */ struct list_head active_list; struct list_head inactive_list; }; struct i915_hw_ppgtt { struct i915_address_space base; int (*switch_mm)(struct i915_hw_ppgtt *ppgtt, struct intel_engine_cs *ring, bool synchronous); }; struct i915_vma { struct drm_mm_node node; struct drm_i915_gem_object *obj; struct i915_address_space *vm; };  The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I’ve not opted to do this. I feel there is too much duplication for not enough benefit. I’ve already explained in different words that a range of used address space is the VMA. If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA. HOLE 0×0->0×64000 VMA 1 0×64000->0×69000 HOLE 0×69000->512M VMA 2 512M->512.004M HOLE ~512M->2GB Allocated space: 0×6000 Free space: 0x7fffa000 ## Relation to the Hardware Context struct intel_context { struct kref ref; int id; ... struct i915_address_space *vm; };  With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs. ## Implicit Context (“private default context”) From here on out we can consider a, “context” as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I’ve called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it’s state from the previous render operation when using the IOCTLs. The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does). There are a few solutions to this problem, and I’ve submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say – every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL. ## Multi Context A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy). struct drm_i915_gem_execbuffer2 { /** * List of gem_exec_object2 structs */ __u64 buffers_ptr; __u32 buffer_count; /** Offset in the batchbuffer to start execution from. */ __u32 batch_start_offset; /** Bytes used in batchbuffer from batch_start_offset */ __u32 batch_len; ... __u64 flags; __u64 rsvd1; /* now used for context info */ __u64 rsvd2; };  A process may wish to create several GL contexts. The API allows this, and for reasons I don’t understand, it’s something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we’ve discussed for a GL context. ## The Big Picture – literally Overview ## Context:PPGTT One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context. Quoting myself from one of earlier public declarations, here: My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed. My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I’m still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you’d get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it’s a bit easier to think of things this way too. Current Mesa Current DDX 2 hypothetical scenarios # Conclusion Please feel free to send me issues or questions. Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone. State Machine ## TODO As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don’t yell at me when things break :-). ## Summary That’s most of it. I like to give the 10 second summary. 1. i915_vma, i915_hw_ppgtt, i915_address_space: important things. 2. The GPU has a virtual address space per DRI file descriptor. 3. There is a connection between the PPGTT, and a Hardware Context. 4. VMAs are backed by BOs which are backed by physical pages. 5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT. And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.  Thing Modern X86 CPU Modern i915 GPU Phys Address Limit 48b? ~40b Process Isolation Yes Yes (with True PPGTT) Virtual Address Space Yes Yes 64b VA Space Yes GEN8+ 48b only PTE access controls Yes No Page Fault Handling Yes No Preemption7 Yes *With execlists  So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU – it still has a ways to go. I would be surprised if things didn’t continue going in this direction. ## SVG Links As usual, please feel free to do something useful with the images I’ve created. Also as usual, they are really poorly named. https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg 1. It’s technically possible to make them be the same BO through the two buffer sharing mechanisms. 2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU “process” 3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow. 4. I would have preferred the reservation of a space within the address space be called a, “GVMA”, but that was shot down during review 5. There’s a whole section below describing how this statement could be false. For now, let’s pretend address spaces can’t be shared 6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you’d expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees. 7. I am defining the word preemption as the ability to switch at an arbitrary point in time between contexts. On the CPU this is easily accomplished. The GPU running the i915 driver as of today has no way to do this. Once a batch is running it cannot be interrupted except for RC6.  July 12, 2014 EDIT1 (2014-07-12): Apologies to planets for update. • Change b->B (bits to bytes) in the state walkthrough (thanks to Bernard Kilarski) • Convert SVG images to PNG because they weren’t being rendered properly. • Added TOC • Use new style footnotes • NOTE: With command parser merged, and execlists on the way – this post is already somewhat outdated. Disclaimer: Everything documented below is included in the Intel public documentation. Anything I say which offends you are my own words and not those of Intel. Sadly, anything I say that is of monetary value are those if Intel. # Intro ## Goal My goal is to lay down a basic understanding of how GEN GPU execution works using gem_exec_nop from the intel-gpu-tools suite as an example. One who puts in the time to read this should understand how command submission works for the i915 driver, and how gem_exec_nop tests command submission. You should also have a decent idea of how the hardware handles execution. I intentionally skip topics like relocations, and how graphics virtual addresses are maintained. They are not directly related towards execution, and would make the blog entry too long. Ideally, I am hoping this will enable people who are interested to file better bugs, improve our tests, or write their own tests. ## Terminology • i915: The name of the Linux kernel driver for Intel GEN graphics. i915 is the name of an ancient chipset that was one of the first supported by the driver. The driver itself supports chipsets both before, and after i915. • BO: Buffer Object. GEM uses handles to identify the buffers used as graphics operands in order to avoidly costly copies from userspace to kernel space. BO is the thing which is encapsulated by that handle. • GEM: Graphics Execution Manager. The name of a design and API to give userspace GPU clients the ability to execute work on a GPU (the API is technically not specific to GEN). • GEN: The name of the Graphics IP developed by Intel Corporation. • GPU client: A userspace application or library that submits GPU work. • Graphics [virtual] Address: Address space used by the GPU for mapping system memory. GEN is an UMA architecture with regard to the CPU. • NOP/NOOP: An assembly instruction mnemonic for a machine opcode that does no work. Note that this is not the same as a lack of work. The instruction is indeed executed, it simply has no side-effects. The execution latency is strictly greater than zero. • relocations: The way in which GEM manages to make GPU clients agnostic to where the buffers are actually mapped by the GPU. Out of scope for this blog entry. ## Source Code The source code in this post is found primarily in two places. Note that the links below are both from very fast moving code bases. ## GEN Hardware Before going over gem_exec_nop, I’d like to give an overview of modern GEN hardware: Coarse GEN block diagram. I don’t want to say this is the exhaustive list, and indeed, each block above has many sub-components. In the part of the driver I work on this is a pretty logical way to split it. Each of the blocks share very little. The common denominator is a Graphics Virtual Address which is understood by all blocks. This provides an easy communication for work needing to be sent between components. As an example, the command streamer might want the display engine to flip to a new surface. It does so by sending a special message to the display engine along with the address of the surface to flip to. The display engine may respond “out of band” via interrupts (flip completion). There are also built in synchronization primitives that allow the command streamer to wait on events sent by the display engine (we’ll get to the command streamer in more detail later). Excluding audio, since I know nothing about audio… by a very rough estimate, 85% of the Linux i915.ko code falls into “Other.” Of the remaining 15% in graphics processing engine, the kernel driver tends to utilize very little of the Fixed Func/EU block above. Total lines of code outside of the kernel driver for the EU block is enormous, given that the X 2d driver (DDX), mesa, libva, and beignet all have tons of lines of code just for utilizing that part of the hardware. # gem_exec_nop gem_exec_nop is one of my favorite tests. For me, it’s the first test I run to determine whether or not to even bother with the rest of the test suite. • It’s dead simple. • It’s fast. • It tests a surprisingly large amount of the hardware, and software. • Gives some indication of performance • It’s deader than dead simple It’s not a perfect test, some of the things which are missing: • Handling under memory pressure (relocs, swaps, etc.) • Tiling formats • Explicit testing of cacheability types, and coherency (LLC et al.) • several GEM interfaces • The aforementioned 85% of the driver • It doesn’t even execute a NOP instruction!!! ## gem_exec_nop flowchart NOTE: I will explain more about what a batchbuffer is later. * (step 1) The docs say we must always follow MI_BATCH_BUFFER_END with an MI_NOOP. The presumed reason for this is that the hardware may prefetch the next instruction, and I think the designers wanted to dumb down the fact that they can't handle a pagefault on the prefetch, so they simply demand a MI_NOOP.  ** (step1) MI_NOOP is defined as a dword of value 0x00000000. GEM BOs are 0 by default. So we have an implicit buffer full of MI_NOOP. 1. Creating a batchbuffer is done using GEM APIs. Here we create a batchbuffer of size 4096, and fill in two instructions. The batchbuffer is the basic unit of execution. The only pertinent point to keep in mind is this is the only buffer being created for this test. Note that this step, or a similar one, is done in almost every test. 2. Here we set up the data structure that will be passed to the kernel in an IOCTL. There’s a pointer to the list of buffers, in our case, just the one batchbuffer created instead one. The size of 8 (each of the two instructions is 4 bytes), and some flags which we’ll skip for now are also included in the struct. 3. The dotted line through step 3 denotes the userspace/kernel barrier. Above the line is gem_exec_nop.c, below is i915_gem_execbuffer.c. DRM, which is a common subsystem interface, actually dispatches the IOCTLs to the i915 driver. 4. The kernel handling the data is received. Talked about in more detail later. 5. Submit to the GPU for execution. Also, detailed later. # Execbuf2 IOCTL and Hardware Execution ## i915.ko execbuffer2 handling (step 4 and 5 in the picture above) The eventual goal of the kernel driver is to take the batchbuffer passed in from userspace, make sure it is visible to the GPU by mapping it, and then submit it to the GPU for execution. The aforementioned operations are synchronous with respect to the IOCTL1. In other words, by the time the execution returns to the application, the GPU knows about the work. The work is completed asynchronously. I’ll detail some of the steps a bit. Unfortunately, I do not have pretty pictures for this one. You can follow along in i915_gem_execbuffer.c; i915_gem_do_execbuffer() 1. copy_from_user – Copy the BOs in from userspace. Remember that the BO is a handle and not actual memory being copied; this allows a relatively small and fast copy to take place. In gem_exec_nop, there is exactly 1 BO: the batchbuffer. 2. some sanity checks – not interesting 3. look up – Do a lookup of all the handles for the BOs passed in via the buffers_ptr member (copied in during #1). Make sure the buffers still exist and so on. In our case this is only one buffer and it’s unlikely that it would be destroyed before execbuffer completes2 4. Space reservation – Make sure there is enough address space in the GPU for the objects. This also includes checking for various alignment restrictions, and a few other details not really relevant to this specific topic. For our example, we’ll have to make sure we have enough space for 1 buffer of size 4096, and no special alignment requirements. It’s the second simplest request possible (first would be to have no buffers). 5. Relocations – save for another day. 6. Ring synchronization – Also not pertinent to gem_exec_nop. Since it involves the command streamer, I’ll include a brief description as a footnote3 7. Dispatch – Finally we can tell the GEN hardware about the work that we just got. This means using some architectural registers to point the hardware at the batchbuffer which was submitted by userspace. More on this shortly… 8. Some more relocation stuff – save for another day ## Execution part I (Command Streamer/Ringbuffer) Fundamentally, all work is submitted via a hardware ringbuffer, and fetching via the command streamer. A command streamer is many things, but for now, saying it’s a DMA engine for copying in commands and associated data is a good enough definition. The ringbuffer is a canonical ringbuffer with a HEAD and TAIL pointer (to be clear: TAIL is the one incremented by the CPU, and read by the GPU. HEAD is written by the GPU and read by the CPU). There is a third pointer known as ACTHD (or Active HEAD) – more on this later. At driver initialization, the space for the ringbuffer is allocated, and the address and size is written to hardware registers. When the driver wants to submit work, it writes data at the current TAIL pointer, and increments the TAIL pointer. Once the TAIL is incremented, the hardware will start reading in commands (via DMA), and increment the HEAD (and ACTHD) pointer as commands are retired. Early GEN hardware had only 1 command streamer. It was referred to as, “CS.” When Ironlake introduced the VCS, or video engine command streamer, they renamed (in some places) the original CS to RCS, for render engine command streamer. Sandybridge introduced the blit engine command streamer BCS, and Haswell the video enhancement command streamer, or VECS. Each command streamer supports its own instruction set, though many instructions are the same on multiple command streamers, MI_NOOP is supported on all of them Having multiple command streamers not only provides an easy way to add new instructions, but it also allows an asynchronous way to submit work, which can be very useful if you are trying to do two orthogonal tasks. As an example, take an OpenCL application running in conjunction with your favorite 3d benchmark. The 3d benchmark internally will only use the 3d and blit hardware, while the OCL application will use the GPGPU hardware. It doesn’t make sense to have either one wait for a single command streamer to fetch the data (especially since I glossed over some other details which make it an even worse idea) if there won’t be any [or few] data dependencies. The kernel driver is the only entity which can insert commands into the ringbuffer. The ringbuffer is therefore considered trusted, and all commands supported by the hardware may be run here (the docs use the word, “secure” but this gets confusing quickly). The way in which the batchbuffer we created in gem_exec_nop gets executed will be explained a bit further shortly, but the contents of that batchbuffer are not directly inserted into the ringbuffer4. Take a quick peek at the text in the image below for how it works. Here is a pretty basic picture describing the above. The HEAD and TAIL point to the next instruction to be executed, therefore this would be midway through step #5 in the flowchart above. ## Execution part II (MI_BATCH_BUFFER_START, batchbuffer) A batchbuffer is the way in which we can submit work to the GPU without having to write into the hardware ringbuffer (since only the kernel driver can do that). A batchbuffer is submitted to the GPU for execution via a command called MI_BATCH_BUFFER_START which is inserted into the ringbuffer and read by the command streamer. Batchbuffers share an instruction set with the command streamer that dispatched them (ie. batches run by the blit engine can issue blit commands), and the execution flow is very similar to that of the command streamer as described in the first diagram, and subsequently. On the other hand, there are quite a few differences. Batchbuffer execution is not guided by HEAD, and TAIL pointers. The hardware will continue to execute every instruction in a batchbuffer until it hits another MI_BATCH_BUFFER_START command, or an MI_BATCH_BUFFER_END. Yes, you can get into an infinite loop of batchbuffers with this nesting of MI_BATCH_BUFFER_START commands. The hardware has an internal HEAD pointer which is exposed for debug purposes called ACTHD. This pointer works exactly like a HEAD point would, except it is never compared against TAIL to determine the end of execution5. MI_BATCH_BUFFER_END will directly guide execution back the hardware ringbuffer. In other words you need only one MI_BATCH_BUFFER_END to break the chain of n MI_BATCH_BUFFER_STARTs. Getting back to gem_exec_nop specifically for a sec: this is what we set up in step #1. Recall it had 2 instructions MI_BATCH_BUFFER_END, MI_NOOP. Here is our graphical representation of the batchbuffer from gem_exec_nop. Notice that the batchbuffer doesn’t have a tail pointer, only ACTHD. ## Hardware states The follow macro-level state machine/flowchart hybrid can be used to describe both ringbuffer execution and batchbuffer execution, though the descriptions differ slightly. By “macro-level” I mean each state may not match exactly to a state within the hardware’s state machines. It’s more of a state in the data flow. The “state machines” for both ringbuffers and batchbuffers are pretty similar. What follows is a diagram that mostly works for both, and a description of each state. I’ll use “RSn” for ringbuffer state n, and “BSn” for batchbuffer state n. • RS0: Idle state, HEAD == TAIL. Waiting for driver to increment tail. • RS1: TAIL has changed. Fetch some amount between HEAD and TAIL (I’d guess it fetches the whole thing since the ringbuffer size is strictly limited). • RS2: Fetch has completed, and command parsing can begin. Command parsing here is relatively easy. Every command is 4B aligned, and has the total command length embedded in the first 4th (1 based) byte of the opcode. Once it has determined the length, it can send that many dwords to the next stage. • RS3: 1 command has been parsed and sent to be executed (pun intended). • RS4: Execute phase required some more work, if the command executed in RS3 requires some extra data, now is when it will get fetched – and AFAICT, the hardware will stall waiting for the fetch to complete. If there is nothing left to do for the command, HEAD is incremented. Most commands will be done and increment HEAD. MI_BATCH_BUFFER_START is a common exception. I wish I could easily change the image… this is really RS3.5. • RS5: An error state requiring a GPU reset. • BS0: ASSERT(last command != MI_BATCH_BUFFER_END) This isn’t a real state. While executing a batchbuffer, you’re never idle. We can use this state as a place to update ACTHD though, so let’s say ACTHD := batchbuffer start address. • BS1: Similar to RS1, fetch the data. Hopefully most of it exists in some internal cache since we had to fetch some amount of it in RS4, but I don’t claim to know the micro-architecture details on this. • BS2: Just like RS2 • BS3: Just like RS3 • BS4: Just like RS4 ## gem_exec_nop state walkthrough With the above knowledge, we can now step through the actual stuff from gem_exec_nop. This combines pretty much all the diagrams above (ie. you might want to reference them), I tried to keep everything factually correct along the way minus the address I make up below. Assume HEAD = 0×30, TAIL = 0×30, ACTHD=0×30 1. Hardware is in Rs0. 2. gem_exec_nop runs; submits previously discussed setup to i915. 3. *** kernel picks address 0×22000 for the batchbuffer (remember I said we’re ignoring how graphics addresses work for now, so just play along) 4. i915.ko writes 4 bytes, MI_BATCH_BUFFER_START to hardware ringbuffer. 5. i915.ko writes 4 bytes, 0×22000 to hardware ringbuffer. 6. i915.ko increments the tail pointer by command length (8). TAIL := 0×38 7. RS0->RS1: DMA fetches TAIL-HEAD bytes. (0×38-0×30) = 8B 8. RS1->RS2: DMA completes. Parsing will find that the command is MI_BATCH_BUFFER_START, and it needs 1 extra dword to proceed. This 8B command is then ready to move on. 9. RS2->RS3: Command was successfully parsed. There is a batchbuffer to be fetched, and once that completes we need to execute it. 10. RS3->RS4: Execution was okay, DMA fetch of the batchbuffer at 0×22000 starts…completes 11. RS4->BS0: ACTHD := 0×22000 12. BS0->BS1: We’re in a batchbuffer. The commands we need to fetch are in our local cache, fetched by the ringbuffer just before so no need to do anything more. 13. BS1->BS2: Parsing of the batchbuffer begins. The first command pointed to by ACTHD is MI_BATCH_BUFFER_END. It is only 4b. 14. BS2->BS3: Parse was successful. Execute the command MI_BATCH_BUFFER_END. ACTHD := 4. There are no extra requirements for this command. 15. BS3->RS0: Batchbuffer told us to end, so we go back to the ring. Increment our HEAD pointer by the size of the last command (8B). Set ACTHD equal to HEAD. HEAD := 0×38. ACTHD := 0×38. 16. HEAD == TAIL… we’re idle. # Summary User space builds up a command, and list of buffers. Then the userspace tells the kernel about it via IOCTL. Kernel does some work on the command to find all the buffers and so on, then submits it to the hardware. Some time later, userspace can see the results of the commands (not discussed in detail). On the hardware side, we’ve got a ringbuffer with a head and tail pointer, a way to dispatch commands which are located sparsely in our address space, and a way to get execution back to the ringbuffer. ## SVG links 1. The synchronous nature of the IOCTL is something which has been discussed several times. One usage model which would really like to break that is a GPU scheduler. In the case of a scheduler, we’d want to queue up work and return to userspace as soon as possible; but that work may not yet make it into the hardware. 2. Buffer objects are managed with a reference count. When a buffer is created, it gets a ref count of 1, and the refcount is decremented either when the object is explicitly destroyed, or the application ceases to exist. Therefore, the only way gem_exec_nop can fail during the look up portion of execbuffer, is if the application somehow dies after creating the batchbuffer, but before calling the execbuffer IOCTL. 3. As I showed in the first diagram we consider the command executed to be “in order.” Here this means that commands are executed sequentially, (and hand waving over some caching stuff) the side effects of commands are completed by execution of the later commands. This made the implicit synchronization that is baked in to the GEM API really easy to handle (the API has no ways to explicitly add synchronization objects). To put this another way, if a GPU client submits a command that operates on object X, then a second command also operating on object X; it was guaranteed to execute in that order (as long as there was no race condition in userspace submitting commands). However, when you have multiple instances of the in-order command streamers, synchronization is no longer free. If a command is submitted to command streamer1 referencing object X, and then a second command is submitted to command streamer2 also referencing object X… no guarantees are made by hardware about the order the of the commands. In this case, synchronization can be achieved in two ways: hardware based semaphores, or by stalling on the second command until that first one completes. 4. Certain commands which may provide security risks are not allowed to be executed by untrusted entities. If the hardware parses such a command from an untrusted entity, it will convert it into an MI_NOOP. Batchbuffers can be executed in a trusted manner, but implementing such a thing is complex. 5. When the CS is execution from the ring, HEAD == ACTHD. Once the CS jumps into the batchbuffer, ACTHD will take on the address within the batchbuffer, while HEAD will remain only relevant to it’s position in the ring. We use this fact to help us debug whether we hung in the batch, or in the ring.  July 10, 2014 One feature we are spending quite a bit of effort in around the Workstation is container technologies for the desktop. This has been on the wishlist for quite some time and luckily the pieces for it are now coming together. Thanks to strong collaboration between Red Hat and Docker we have a great baseline to start from. One of the core members of the desktop engineering team, Alex Larsson, has been leading the Docker integration effort inside Red Hat and we are now preparing to build onwards on that work, using the desktop container roadmap created by Lennart Poettering. So while Lennarts LinuxApps ideas predates Docker they do provide a great set of steps we need to turn Docker into a container solution not just for server and web applications, but also for desktop applications. And luckily a lot of the features we need for the desktop are also useful for the other usecases, for instance one of the main things Red Hat has been working with our friends at Docker on is integrating systemd with Docker. There are a set of other components as part of this plan too. One of the big ones is Wayland, and I assume that if you are reading this you have already seen my Wayland in Fedora updates. Two other core technologies we identified are kdbus and overlayfs. Alex Larsson has already written an overlayfs backend for Docker, and Fedora Workstation Steering committee member, Josh Bowyer, just announced the availability of a Copr which includes experimental kernels for Fedora with overlayfs and kdbus enabled. In parallel with this, David King has been prototyping a version of Cheese that can be run inside a container and that uses this concept that in the LinuxApps proposal is called ‘Portals’, which is basically dbus APIs for accessing resources outside the container, like the webcam and microphone in the case of Cheese. For those interested he will be presenting on his work at GUADEC at the end of the Month, on Monday the 28th of July. The talk is called ‘Cheese: TNG (less libcheese, more D-Bus)’ So all in all the pieces are really starting to come together now and we expect to have some sessions during both GUADEC and Flock this year to try hammer out the remaining details. If you are interested in learning more or join the effort be sure to check the two conferences notice boards for time and place for the container sessions. There is still a lot of work to do, but I am confident we have the right team assembled to do it. In addition to the people already mentioned we for instance have Allan Day who is kicking off an effort to look at the user experience we want to provide around the container hosted application bundles in terms of upgrades and installation for instance. And we will also work with the wider Docker community to make sure we have great composition tools for creating these container images available for developers on Fedora.  July 04, 2014 Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China. My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.) The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd. A number of pictures of the conference are now online. Enjoy! After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too. I am really looking forward to the next FUDCON/GNOME.Asia! Update: I had actually managed to disable the VAAPI encoding in 1.2, so I just rolled a 1.3 release which re-enabled it. Apart from that it is identical So I finally managed to put out a new Transmageddon release today. It is primarily a bugfix release, but considering how many critical bugs I ended up fixing for this release I am actually a bit embarassed about my earlier 1.x releases. There was for instances some stupidity in my code that triggered thread safety issues, which I know hit some of my users quite badly. But there were other things not working properly either, like dropping the video stream from a file. Anyway, I know some people think that filing bugs doesn’t help, but I think I fixed every reported Transmageddon bug with this release (although not every feature request bugzilla item). So if you have issues with Transmageddon 1.2 please let me know and I will try my best to fix them. I do try to keep a policy that it is better to have limited functionality, but what is there is solid as opposed to have a lot of features that are unreliable or outright broken. That said I couldn’t help myself so there are a few new features in this release. First of all if you have the GStreamer VAAPI plugins installed (and be sure to have the driver too) then the VAAPI GPU encoder will be used for h264 and MPEG2. Secondly I brought back the so called ‘xvid’ codec (even though xvid isn’t really a separate codec, but a name used to refer to MPEG4 Video codec using the advanced-simple profile.). So as screenshot blow shows, there is not a lot of UI changes since the last version, just some smaller layout and string fixes, but stability is hopefully greatly improved. I am currently looking at a range of things as the next feature for Transmageddon including: • Batch transcoding, allowing you to create a series of transcoding jobs upfront instead of doing the transcodes one by one • Advanced settings panel, allowing you to choose which encoders to use for a given format, what profiles to use, turn deinterlacing on/off and so on • Profile generator, create new device profiles by inspecting existing files • Redo the UI to switch away from deprecated widgets If you have any preference for which I should tackle first feel free to let me know in the comments and I will try to allow popular will decide what I do first P.S. I would love to have a high contrast icon for Transmageddon (HighContrast App icon guidelines) – So if there is any graphics artists out there willing to create one for me I would be duly greatful  July 03, 2014 As we are approaching Fedora Workstation 21 we held a meeting to review our Wayland efforts for Fedora Workstation inside Red Hat recently. Switching to a new core technology like Wayland is a major undertaking and there are always big and small surprises that comes along the way. So the summary is that while we expect to have a version of Wayland in Fedora Workstation 21 that will be able to run a fully functional desktop, there are some missing pieces we now know that will not make it. Which means that since we want to ship at least one Fedora release with a feature complete Wayland as an option before making it default, that means that Fedora Workstation 23 is the earliest Wayland can be the default. Anyway, here is what you can expect from Wayland in Fedora 21. • Wayland session available in GDM (already complete and fully working) • XWayland working, but without accelerated 3D (done, adding accelerated 3D will be done before FW 22) • Wayland session working with all free drivers (Currently only Intel working, but we expect to have NVidia and AMD support enabled before F21) • IBUS input working. (Using the IBUS X client. Wayland native IBUS should be ready for FW22.) • Touchpad acceleration working. (Last missing piece for a truly usable Wayland session, lots of work around libinput and friends currently to have it ready for F21). • Wacom tablets will not be ready for F21 • 3D games should work using the Wayland backend for SDL2. SDL1 games will need to wait for FW22 so they can use the accelerated XWayland support). • Binary driver support from NVidia and AMD very unlikely to be ready for F21. • Touch screen support working under Wayland. We hope to have F21 testbuilds available soon that the wider community can use to help us test, because even when all the big ‘checkboxes’ are filled in there will of course be a host of smaller issues and outright bugs that needs ironing out before Wayland is ready to replace X completely. We really hope the community will get involved with testing Wayland so that we can iron out all major bugs before F21. How to get involved with the Fedora Workstaton effort To help more people get involved we recently put up a tasklist for the Fedora Workstation. It is a work in progress, but we hope that it will help more people get involved and help move the project forward.  June 26, 2014 Hi folks, Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later. To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them. ## ioctl calls vs software methods An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example. ## A first prototype using software methods Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods. #### Configure a counter To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods. #### Sample a counter This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio. #### Expose counter’s data to the userspace Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries. #### Synchronize queries with a sequence number To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one. This buffer looks like this : To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it. Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation. I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them. Design and implement a kernel interface with an elegant way takes a while… See you soon for the full implementation!  June 25, 2014 Firewalls Fedora has had problems for a long while with the default firewall rules. They would make a lot of things not work (media and file sharing of various sorts, usually, whether as a client or a server) and users would usually disable the firewall altogether, or work around it through micro-management of opened ports. We went through multiple discussions over the years trying to break the security folks' resolve on what should be allowed to be exposed on the local network (sometimes trying to get rid of the firewall). Or rather we tried to agree on a setup that would be implementable for desktop developers and usable for users, while still providing the amount of security and dependability that the security folks wanted. The last round of discussions was more productive, and I posted the end plan on the Fedora Desktop mailing-list. By Fedora 21, Fedora will have a firewall that's completely open for the user's applications (with better tracking of what applications do what once we have application sandboxing). This reflects how the firewall was used on the systems that the Fedora Workstation version targets. System services will still be blocked by default, except a select few such as ssh or mDNS, which might need some tightening. But this change means that you'd be sharing your music through DLNA on the café's Wi-Fi right? Well, this is what this next change is here to avoid. Per-network Sharing To avoid showing your music in the caf, or exposing your holiday photographs at work, we needed a way to restrict sharing to wireless networks where you'd already shared this data, and provide a way to avoid sharing in the future, should you change your mind. Allan Day mocked up such controls in our Sharing panel which I diligently implemented. Personal File Sharing (through gnome-user-share and WedDAV), Media Sharing (through rygel and DLNA) and Screen Sharing (through vino and VNC) implement the same per-network sharing mechanism. Make sure that your versions of gnome-settings-daemon (which implements the starting/stopping of services based on the network) and gnome-control-center match for this all to work. You'll also need the latest version of all 3 of the aforementioned sharing utilities. (and it also works with wired network profiles :) Lately at Collabora I have been working on helping Mozilla with the GTK+ 3 port of Firefox. ## The problem The issue we had to solve is that GTK+ 2 and GTK+ 3 cannot be loaded in the same address space. Moving Firefox from GTK+ 2 to GTK+ 3 isn’t a problem, as only GTK+ 3 gets loaded in its address space, and everything is fine. The problem comes when you load a plugin that links to GTK+ 2, e.g. Flash. Then, GTK+ 2 and GTK+ 3 get both loaded, GTK+ detects that, and aborts to avoid bigger problems. This was tracked as bug #624422. More specifically, Firefox links to libxul.so, which in turn links to GTK+. These days, the plugins are loaded in a separate process, plugin-container, which communicates with the Firefox process through IPC. If plugin-container didn’t link to GTK+, there would be absolutely no problem, as the browser (Firefox) process could link to GTK+ 3 and plugin-container could load any plugin, including GTK+ 2 ones. However, although plugin-container doesn’t directly use GTK+, it links to libxul.so for IPC, which brings GTK+ into its address space. ## The solution In order to solve this, we evaluated various options. The first one was to split libxul.so in two parts, one with the IPC code and lower level stuff, which wouldn’t link to GTK+, and another side with the rest of the code, including all the widget and toolkit integration, which would obviously link to GTK+. However this turned not to be possible as the libxul code was too intricate. In the end, we decided to add a thin layer between libxul and GTK+, which we called libmozgtk.so. This small layer links to GTK+ 3, and provides stubs for GTK+ 2 specific symbols. Additionally, there is a libmozgtk2.so with SONAME “libmozgtk.so”, which links to GTK+ 2 and provides stubs for GTK+ 3 symbols. We made libxul link against libmozgtk.so, and so when Firefox runs, libxul.so, libmozgtk.so, and GTK+ 3 are loaded, and Firefox uses GTK+ 3. However when plugin-container is executed, we add LD_PRELOAD=libmozgtk2.so in the environment. Since libmozgtk2.so has a libmozgtk.so SONAME, the libxul.so dependency is satisfied, and the plugin-container process ends with GTK+ 2. Since plugin-container doesn’t make use of the GTK+ code in libxul, this is safe, and we end up with a GTK+ 3 Firefox that can load GTK+ 2 plugins. The end result is that you can watch Youtube videos again! While this solution is somewhat hacky, it means we didn’t need to mess with libxul, splitting it in two just for the Linux/GTK+ port’s sake. And when the GTK+ 2 plugins become irrelevant, or NPAPI support is removed (as it recently happened in Chrome), we should be able to easily revert this and use GTK+ 3 everywhere. ## Wayland On an unrelated note, we have looked a bit at porting Firefox to Wayland. Wayland is designed to be a replacement for X11, and is becoming very popular in the digital TV and set top box space. Those obviously need HTML engines and web browsers, and with WebKit and Chrome already having Wayland ports, we think Firefox shouldn’t fall behind. For this, the GTK+ 3 port was a prerequisite, but that isn’t enough. There are many X11 uses on the Firefox codebase, most of which are guarded by #ifdef MOZ_X11, though not all of them are. We got Firefox to start on Weston (the Wayland reference compositor) with a bunch of hacks, one of which broke keyboard input (but avoided a segfault). As you can see from the screenshot, things aren’t perfect, but it’s at least a good start!  June 23, 2014 This will, I think, be the first time blogging about something quite so retroactively, but for reasons which should be apparent, I could not blog about this little adventure until now. This is the story of CVE-2014-0972 (QCIR-2014-00004-1), and (at least part of) how I was able to install fedora on my firetv: #### Introduction.. Back in April, I bought myself a Fire TV, with the thought that it would make a nice fedora xbmc htpc setup, complete with open src drivers, to replace my aging pandaboard. But, of course, as delivered the Fire TV is locked down with no root access. At the same time, there was a feature of the downstream android kernel gpu driver (kgsl), per-context pagetables, which had been on my TODO list for the upstream drm/msm driver for a while now. But, I needed to understand better what kgsl was doing and the interactions with the hardware, in particular the behaviour of the CP (command processor), in order to convince myself that such a feature was safe. People generally frown on introducing root holes in the upstream kernel, and I didn't exactly have documentation about the hardware. So it was time to roll up my sleeves and get some hands-on experience (translation: try to poke and crash the gpu in lots of different ways and try to make sense of the result). #### Into the rabbit hole.. The modern snapdragon SoCs use IOMMUs everywhere. Including the GPU. To implement per-context gpu pagetables, basically all the driver needs to do is to bang a few IOMMU registers to change the pagetable base addr and invalidate the TLB. But this must be done when you are sure the GPU is not still trying to access memory mapped in the old page tables. Since a GPU is a highly asynchronous device, it would be a big performance hit to stall until GPU ringbuffer drains, then reprogram IOMMU, then resume the GPU with commands from the new context. To avoid this performance hit, kgsl maps some of the IOMMU registers into the GPU's virtual address space, and emits commands into the ringbuffer for the CP to write the necessary registers to switch pagetables and invalidate TLB. It was this reprogramming of IOMMU from the GPU itself which I needed to understand better. Anyone who understands GPU's would have the initial reaction that this is extremely dangerous. But kgsl was, it seemed, taking some protections. However, I needed to be sure I properly understood how this worked, to see if there was something that was overlooked. The GPU, in fact, has two hw contexts which it can switch between. Essentially it is in some ways similar to supervisor vs user context on a CPU. The way kgsl uses this is to map the IOMMU registers into the supervisor context, but not user contexts. The ringbuffer is mapped into all the user contexts, plus supervisor context, at the same device virtual address. The idea being that if the ringbuffer is mapped in the same position in all contexts, you can safely context switch from commands in the ringbuffer. To do this, kgsl emits commands for the CP to write a special bit in CP_STATE_DEBUG_INDEX to switch to the "supervisor" context. Then commands to write IOMMU registers, followed by write to CP_STATE_DEBUG_INDEX to switch back to user context. (I'm over-simplifying slightly, as there are some barriers needed to account for asynchronous writes.) But userspace constructed commands never execute from the ringbuffer, instead the kernel puts an IB (indirect branch) into the ringbuffer to jump to the userspace constructed cmdstream buffer. This userspace cmdstream buffer is never mapped into supervisor context, or into other user's contexts. So in theory, if userspace tried to write CP_STATE_DEBUG_INDEX to switch to supervisor mode (and gain access to the IOMMU registers), the GPU would immediately page fault, since the cmdstream it was in the middle of executing is no longer mapped. Ok, so far, so good. #### Where it breaks down.. From my attempts at switching to supervisor mode from IB1, and deciphering the fault address where the gpu crashed, and iommu register dumps, I could tell that the next few commands after the switch to supervisor mode where excuted without problem.. there is some prefetch/pipelining! But much more conveniently, while poking around, I realized that there were a couple pages mapped globally (in supervisor and all user contexts), which where mapped writable in user contexts. I used the so called "setstate" buffer. So I simply had to construct a cmdstream buffer to write the commands I wanted to execute into the setstate buffer, and then do an IB to that buffer and do the supervisor switch in IB2. Ok.. but do do anything useful with this, I'd need a reasonable chunk of physically contiguous pages, at a known physical address.. in particular 16K for first level pagetables and 16K second level pagetables. Fortunately ION comes to the rescue here, with it's physically contiguous carveouts at known physical addresses. In this case, allocate from the multimedia pool when there is no video playback, etc, going on. This way ION allocates from the beginning of the carveout pool, a known address. Into this buffer, construct a new set of pagetables, which map whatever physical address you want to read/write (hint, any of kernel lowmem), a replacement page for the setstate buffer (since we don't know the original setstate buffer's physical address.. which means we actually have two copies of the commands copied into setstate buffer, one copied via gpu to original setstate page, and one written directly by cpu in the replacement setstate page). The proof of concept that I made simply copied the string "Kilroy was here" into a kernel buffer. But quite easily any random app downloaded from an untrusted source could access any memory, become root, etc. Not the sort of thing you want falling into the wrong hands. Once I managed to prove to myself that I understood properly how the hw was working, I wrote up a short report, and submitted it (plus proof of concept) to the qualcomm security team. Now that the vulnerability is no longer embargoed, I've made available the proof of concept and report here. Originally I planned to (once fixes were pushed out, so as to not put someone who did not intend to root their device at risk) release a jailbreak based on this vulnerability. But once towelroot was released, there was no longer a need for me to turn this into an actual firetv jailbreak. Which saves me from having to figure out how to make an apk. #### Parting thoughts.. 1. Well, knownledge about physical addresses and contiguous memory in userspace, while it might not be a security problem in and of itself, it sure helps turn other theoritical exploits into actual exploits. 2. As far as downstream vendor drivers go, the kgsl driver is actually pretty decent, in terms of code quality, etc. I've seen far worse. Admittedly this was not a trivial hole. But imagine what issues lurk in other downstream gpu/camera/video/etc drivers. Security is often not simple, and I really doubt whether the other downstream drivers are getting a critical look (from good-guys who will report the issue responsibly). 3. I used to think of the whole one-kernel-branch-per-device wild-west ways of android as a bit of a headache. Now I realize it is a security nightmare. An important part of platform security is being able to react quickly when (not if) vulnaribilites are found. In the desktop/server world, CVEs are usually not embargoed for more than a week.. that is all you need, since fortunately we don't need a different kernel for each different make and model of server, laptop, etc. In the mobile device world, it is quite a different story!  June 22, 2014 It's been a week now, and I've made surprising amounts of progress on the project. I came in with this giant task list I'd been jotting down in Workflowy (Thanks for the emphatic recommendation of that, Qiaochu!). Each of the tasks I had were things where I'd have been perfectly unsurprised if they'd taken a week or two. Instead, I've knocked out about 5 of them, and by Friday I had phire's "hackdriver" triangle code running on a kernel with a relocations-based GEM interface. Oh, sure, the code's full of XXX comments, insecure, and synchronous, but again, a single triangle rendering in a month would have been OK with me. I've been incredibly lucky, really -- I think I had reasonable expectations given my knowledge going in. One of the ways I'm lucky is that my new group is extremely helpful. Some of it is things like "oh, just go talk to Dom about how to set up your serial console" (turns out minicom fails hard, use gtkterm instead. Also, someone else will hand you a cable instead of having to order one, and Derek will solder you a connector. Also, we hid your precious dmesg from the console after boot, sorry), but it extends to "Let's go have a chat with Tim about how to get modesetting up and running fast." (We came up with a plan that involves understanding what the firmware does with the code I had written already, and basically whacking a register beyond that. More importantly, they handed me a git tree full of sample code for doing real modesetting, whenever I'm ready.). But I'm also lucky that there's been this community of outsiders reverse engineering the hardware. It meant that I had this sample "hackdriver" code for drawing a triangle with the hardware entirely from userspace, that I could incrementally modify to sit on top of more and more kernel code. Each step of the way I got to just debug that one step to go from "does not render a triangle" back to "renders that one triangle." (Note: When a bug in your command validator results in pointing the framebuffer at physical address 0 and storing the clear color to it, the computer will go away and stop talking to you. Related note: When a bug in your command validator results in reading your triangle from physical address 0, you don't get a triangle. It's like a I need a command validator for my command validator.). https://github.com/anholt/linux/tree/vc4 is the code I've published so far. Starting Thursday night I've been hacking together the gallium driver. I haven't put it up yet because 1) it doesn't even initialize, but more importantly 2) I've been using freedreno as my main reference, and I need to update copyrights instead of just having my boilerplate at the top of everything. But next week I hope to be incrementally deleting parts of hackdriver's triangle code and replacing it with actual driver code.  June 20, 2014 NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows. This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) . To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows. In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ? Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ? NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation. Case #2 : How NVPerfKit handles only one counter per domain ? This is the simplest case, and there are no particular requirements. Case #3 : How NVPerfKit handles multiple counters per domain ? NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame. Case #4 : How NVPerfKit handles multiple counters on different domains ? No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor). To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain. Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference): All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame. This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week. See you the next week!  June 18, 2014 Bartholomea annulata | (c) Kevin Bryant It is time for a new Tanglu update, which has been overdue for a long time now! Many things happened in Tanglu development, so here is just a short overview of what was done in the past months. ## Infrastructure ### Debile The whole Tanglu distribution is now built with Debile, replacing Jenkins, which was difficult to use for package building purposes (although Jenkins is great for other things). You can see the Tanglu builders in action at buildd.tg.o. The migration to Debile took a lot of time (a lot more than expected), and blocked the Bartholomea development at the beginning, but now it is working smoothly. Many thanks to all people who have been involved with making Debile work for Tanglu, especially Jon Severinsson. And of course many thanks to the Debile developers for helping with the integration, Sylvestre Ledru and of course Paul Tagliamonte. ### Archive Server Migration Those who read the tanglu-announce mailinglist know this already: We moved the main archive server stuff at archive.tg.o to to a new location, and to a very powerful machine. We also added some additional security measures to it, to prevent attacks. The previous machine is now being used for the bugtracker at bugs.tg.o and for some other things, including an archive mirror and the new Tanglu User Forums. See more about that below ## Transitions There is huge ongoing work on package transitions. Take a look at our transition tracker and the staging migration log to get a taste of it. Merging with Debian Unstable is also going on right now, and we are working on merging some of the Tanglu changes which are useful for Debian as well (or which just reduce the diff to Tanglu) back to their upstream packages. ## Installer Work on the Tanglu Live-Installer, although badly needed, has not yet been started (it’s a task ready for taking by anyone who likes to do it!) – however, some awesome progress has been made in making the Debian-Installer work for Tanglu, which allows us to perform minimal installations of the Tanglu base systems and allows easier support of alternative Tanglu falvours. The work on d-i also uncovered a bug which appeared with the latest version of findutils, which has been reported upstream before Debian could run into it. This awesome progress was possible thanks to the work of Philip Muškovac and Thomas Funk (in really hard debug sessions). ## Tanglu Forums We finally have the long-awaited Tanglu user forums ready! As discussed in the last meeting, a popular demand on IRC and our mailing lists was a forum or Stackexchange-like service for users to communicate, since many people can work better with that than with mailinglists. Therefore, the new English TangluUsers forum is now ready at TangluUsers.org. The forum software is in an alpha version though, so we might experience some bugs which haven’t been uncovered in the testing period. We will watch how the software performs and then decide if we stick to it or maybe switch to another one. But so far, we are really happy with the Misago Forums, and our usage of it already led to the inclusion of some patches against Misago. It also is actively maintained and has an active community. ## Misc Things ### KDE We will ship with at least KDE Applications 4.13, maybe some 4.14 things as well (if we are lucky, since Tanglu will likely be in feature-freeze when this stuff is released). The other KDE parts will remain on their latest version from the 4.x series. For Tanglu 3, we might update KDE SC 4.x to KDE Frameworks 5 and use Plasma 5 though. ### GNOME Due to the lack manpower on the GNOME flavor, GNOME will ship in the same version available in Debian Sid – maybe with some stuff pulled from Experimental, where it makes sense. A GNOME flavor is planned to be available. ### Common infrastructure We currently run with systemd 208, but a switch to 210 is planned. Tanglu 2 also targets the X.org server in version 1.16. For more changes, stay tuned. The kernel release for Bartholomea is also not yet decided. ### Artwork Work on the default Tanglu 2 design has started as well – any artwork submissions are most welcome! ## Tanglu joins the OIN The Tanglu project is now a proud member (licensee) of the Open Invention Network (OIN), which build a pool of defensive patents to protect the Linux ecosystem from companies who are trying to use patents against Linux. Although the Tanglu community does not fully support the generally positive stance the OIN has about software patents, the OIN effort is very useful and we agree with it’s goal. Therefore, Tanglu joined the OIN as licensee. And that’s the stuff for now! If you have further questions, just join us on #tanglu or #tanglu-devel on Freenode, or write to our newly created forum! – You can, as always, also subscribe to our mailinglists to get in touch.  June 17, 2014 (Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.) In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features: 1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices. 2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery. 3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot. 4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting. ### Concepts A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems: 1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated. 2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.) 3. Startup without either /etc or /var (We will call these stateless systems.). A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble). #### Problems Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time. Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully. To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot. Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional: • A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances. To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r. Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary. Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users. • We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs. • A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary. • We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/ld.so.cache. • If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting. • We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing. • tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place. • We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this: # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier. ### What's Next Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere. So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means: • When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing. • When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts. • When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.) If you are a packager, you can also help on making this all work: • Ask upstream to implement what we describe above, possibly even preparing a patch for this. • If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc. • Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough. Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first. Anyway, so much about what we have been working on and where we want to take this. #### Conclusion Before we finish, let me stress again why we are doing all this: 1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell". 2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design. 3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit. 4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot. 5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version. 6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr. 7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case. Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach. Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems. Anyway, let's get the ball rolling! Late's make stateless systems a reality! And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+. Yesterday was my first day working at Broadcom. I've taken on a new role as an open source developer there. I'm going to be working on building an MIT-licensed Mesa and kernel DRM driver for the 2708 (aka the 2835), the chip that's in the Raspberry Pi. It's going to be a long process. What I have to work with to start is basically sample code. Talking to the engineers who wrote the code drops we've seen released from Broadcom so far, they're happy to tell me about the clever things they did (their IR is pretty cool for the target subset of their architecture they chose, and it makes instruction scheduling and register allocation *really* easy), but I've had universal encouragement so far to throw it all away and start over. So far, I'm just beginning. I'm still working on getting a useful development environment set up and building my first bits of stub DRM code. There are a lot of open questions still as to how we'll manage the transition from having most of the graphics hardware communication managed by the VPU to having it run on the ARM (since the VPU code is a firmware blob currently, we have to be careful to figure out when it will stomp on various bits of hardware as I incrementally take over things that used to be its job). I'll have repos up as soon as I have some code that does anything. # Overview Pictures are the right way to start. Conceptual view of aliasing PPGTT bind/unbind There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible. 1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT. ## The wordy overview Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code. In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1. ## Selecting GGTT or PPGTT Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform. # Architecture The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts. There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table. There are several good diagrams which I won’t bother redrawing in the PRMs2 Page Directory Entry 31:12 11:04 03:02 01 0 Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid Page Table Entry 31:12 11 10:04 03:01 0 Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid There’s some things we can get from this for those too lazy to click on the links to the docs. 1. PPGTT page tables exist in physical memory. 2. PPGTT PTEs have the exact same layout as GGTT PTEs. 3. PDEs don’t have cache attributes (more on this later). 4. There exists support for big pages3 With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size). • A PDE is 4 bytes wide • A PTE is 4 bytes wide • A Page table occupies 4k of memory. • There are 4k/4 entries in a page table. With all this information, I now present you a slightly more accurate picture. An object with an aliased PPGTT mapping ## Size PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case. PPGTT Directory Cacheline Valid Register (from the docs) 63:32 31:0 MBZ PPGTT Directory Cache Restore [1..32] 16 entries DCLV, the right way 31 30 1 0 PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit. We now know how much address space we have. 512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB ## Location PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT. ## Programming With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum. ... ret = intel_ring_begin(ring, 6); if (ret) return ret; intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(2)); intel_ring_emit(ring, RING_PP_DIR_DCLV(ring)); intel_ring_emit(ring, PP_DIR_DCLV_2G); // program size intel_ring_emit(ring, RING_PP_DIR_BASE(ring)); intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location intel_ring_emit(ring, MI_NOOP); intel_ring_advance(ring); ...  As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes. ### Initialization All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8. 1. Allocate space in the GGTT for the PPGTT PDEs ret = drm_mm_insert_node_in_range_generic(&dev_priv->gtt.base.mm, &ppgtt->node, GEN6_PD_SIZE, GEN6_PD_ALIGN, 0, 0, dev_priv->gtt.base.total, DRM_MM_TOPDOWN);  2. Allocate the page tables for (i = 0; i < ppgtt->num_pd_entries; i++) { ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL); if (!ppgtt->pt_pages[i]) { gen6_ppgtt_free(ppgtt); return -ENOMEM; } }  3. [possibly] IOMMU map the pages for (i = 0; i < ppgtt->num_pd_entries; i++) { dma_addr_t pt_addr; pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096, PCI_DMA_BIDIRECTIONAL); ... }  As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs. ### IOMMU As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions. Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate. ## Cached page tables Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register. Similarly in the i915 code: ecochk = I915_READ(GAM_ECOCHK); if (IS_HASWELL(dev)) { ecochk |= ECOCHK_PPGTT_WB_HSW; } else { ecochk |= ECOCHK_PPGTT_LLC_IVB; ecochk &= ~ECOCHK_PPGTT_GFDT_IVB; } I915_WRITE(GAM_ECOCHK, ecochk);  What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important. Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right. Flow chart for GPU GGTT memory access. Red means slow. Flow chart for GPU PPGTT memory access. Red means slow. Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case. Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT. ## Distinctions from the GGTT At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space. A brief description of why, with all the current callers of the global pin interface. • Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever. • i915_gem_object_pin_to_display_plane(): page flipping • intel_setup_overlay(): overlays • Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists. • allocate_ring_buffer() • HW Contexts: Extremely similar to ringbuffer. • intel_alloc_context_page(): Ironlake RC6 • i915_gem_create_context(): Create the default HW context • i915_gem_context_reset(): Re-pin the default HW context • do_switch(): Pin the logical context we’re switching to • Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists. • init_status_page() • Workarounds: • init_pipe_control(): Initialize scratch space for workarounds. • intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand • render_state_alloc(): Full initialization of GPUs 3d state from within the kernel • Other • i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here. • i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here. • i915_gem_pin_ioctl(): The DRI1 pin interface. # GEN8 disambiguation Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration. • PTE size increased to 8b • Therefore, 512 entries per table • Format mimics the CPU PTEs • PDEs increased to 8b (remains 512 PDEs per PD) • Page Directories live in system memory • GGTT no longer holds the PDEs. • There are 4 PDPs, and therefore 4 PDs • PDEs are cached in LLC instead of special cache (I’m guessing) • New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing. • PP_DIR_BASE, and PP_DCLV are removed • Support for 4 level page tables, up to 48b virtual address space. • PML4[PML4E]->PDP • PDP[PDPE] -> PD • PD[PDE] -> PT • PT{PTE] -> Memory • Big pages are now 64k instead of 32k (still not implemented) • New caching interface via PAT like structure # Summary There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today. Recapping: • The Aliasing PPGTT is a single instance of the hardware feature: PPGTT. • Aliasing PPGTT was designed as a drop in performance replacement to the GGTT. • GEN8 changed a lot of architectural stuff. • The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set. Like last time, links to all the SVGs I’ve created. Use them as you like. https://bwidawsk.net/blog/wp-content/uploads/2014/06/appgtt_concept.svg https://bwidawsk.net/blog/wp-content/uploads/2014/06/real_ppgtt.svg https://bwidawsk.net/blog/wp-content/uploads/2014/06/ggtt_flow.svg https://bwidawsk.net/blog/wp-content/uploads/2014/06/ppgtt_flow.svg 1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT 2. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority. 3. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches 4. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry(  June 11, 2014 So over the past few years the drm subsystem gained some very nice documentation. And recently we've started to follow suite with the Intel graphics driver. All the kernel documenation is integrated into one big DocBook and I regularly upload latest HTML builds of the Linux DRM Developer's Guide. This is built from drm-intel-nightly so has slightly freshed documentation (hopefully) than the usual DocBook builds from Linus' main branch which can be found all over the place. If you want to build these yourself simply run  make htmldocs

For testing we now also have neat documentation for the infrastructure and helper libraries found in intel-gpu-tools. The README in the i-g-t repository has detailed build instructions - gtkdoc is a bit more of a fuzz to integrate.

Below the break some more details about documentation requirements relevant for developers.

So from now on I expect reasonable documentation for new, big kernel features and for new additions to the i-g-t library.

For i-g-t the process is simple: Add the gtk-doc comment blocks to all newly added functions, install and build with gtk-doc enabled. Done. If the new library is tricky (for example the pipe CRC support code) a short overview section that references some functions to get people started is useful, but not really required. And with the exception of the still in-flux kernel modesetting helper library i-g-t is fully documented, so there's lots of examples to copy from.

For the kernel this is a bit more involved, mostly since kerneldoc sucks more. But we also only just started with documenting the drm/i915 driver itself.
1. First extract all the code for your new feature into a new file. There's unfortunately no other way to sensibly split up and group the reference documentation with kerneldoc. But at least that will also be a good excuse to review the related interfaces before extracting them.
2. Create reference kerneldoc comments for the functions used as interfaces to the rest of the driver. It's always a bit a judgement call what to document and what not, since compared to the DRM core where functions must be explicitly exported to drivers there's no clean separate between the core parts and subsystems and more mundane platform enabling code. For big and complicated features it's also good practice to have an overview DOC: section somewhere at the beginning of the file.
3. Note that kerneldoc doesn't have support for markdown syntax (or anything else like that) and doesn't do automatic cross-referencing like gtk-doc. So if you documentation absolutely needs a table or a list you have to do it twice unfortunately: Once as a plain code comment and once as a DocBook marked-up table or list. Long-term we want to improve the kerneldoc markup support, but for now we have to deal with what we have.
4. As with all documentation don't document the details of the implementation - otherwise it will get stale fast because comments are often overlooked when updating code.
5. Integrate the new kerneldoc section into the overall DRM DocBook template. Note that you can't go deeper than a section2 nesting for otherwise the reference documentation won't be lists, and due to the lack of any autogenerated cross-links inaccessible and useless. Build the html docs to check that your overview summary and reference sections have all been pulled in and that the kerneldoc parser is happy with your comments.
A really nice example for how to do this all is the documentation for the gen7 cmd parser in i915_cmd_parser.c.
 June 10, 2014

#### Introduction

Gobi chipsets are mobile broadband modems developed by Qualcomm, and they are nowadays used by lots of different manufacturers, including Sierra Wireless, ZTE, Huawei and of course Qualcomm themselves.

These devices will usually expose several interfaces in the USB layer, and each interface will then be published to userspace as different ‘ports’ (not the correct name, but I guess easier to understand). Some of the interfaces wil give access to serial ports (e.g. ttys) in the modem, which will let users execute standard connection procedures using the AT protocol and a PPP session. The main problem with using a PPP session over a serial port is that it makes it very difficult, if not totally impossible, to handle datarates above 3G, like LTE. So, in addition to these serial ports, Gobi modems also provide access to a control port (speaking the QMI protocol) and a network interface (think of it as a standard ethernet interface). The connection procedure then can be executed purely through QMI (e.g. providing APN, authentication…) and then userspace can use a much more convenient network interface for the real data communication.

For a long time, the only way to use such QMI+net pair in the Linux kernel was to use the out-of-tree GobiNet drivers provided by Qualcomm or by other manufacturers, along with user-space tools also developed by them (some of them free/open, some of them proprietary). Luckily, a couple of years ago a new qmi_wwan driver was developed by Bjørn Mork and merged into the upstream kernel. This new driver provided access to both the QMI port and the network interface, but was much simpler than the original GobiNet one. The scope was reduced so much, that most of the work that the GobiNet driver was doing in kernel-space, now it had to be done by userspace applications. There are now at least 3 different user-space implementations allowing to use QMI devices through the qmi_wwan port: ofono, uqmi and of course, libqmi.

The question, though, still remains. What should I use? The upstream qmi_wwan kernel driver and user-space utilities like libqmi? Or rather, the out-of-tree GobiNet driver and user-space utilities provided by manufacturers? I’m probably totally biased, but I’ll try to compare the two approaches by pointing out their main differences.

Note: you may want to read the ‘Introduction to libqmi‘ post I wrote a while ago first.

#### in-tree vs out-of-tree

The qmi_wwan driver is maintained within the upstream Linux kernel (in-tree). This, alone, is a huge advantage compared to GobiNet. Kernel updates may modify the internal interfaces they expose for the different drivers, and being within the same sources as all the other ones, the qmi_wwan driver will also get those updates without further effort. Whenever you install a kernel, you know you’ll have the qmi_wwan driver applicable to that same kernel version ready, so its use is very straightforward. The qmi_wwan driver also contains support for Gobi-based devices from all vendors, so regardless of whether you have a Sierra Wireless modem or a Huawei one (just to name a few), the driver will be able to make your device work as expected in the kernel.

GobiNet is a whole different story. There is not just one GobiNet: each manufacturer keeps its own. If you’re using a Sierra Wireless device you’ll likely want to use the GobiNet driver maintained by them, so that for example, the specific VID/PID pairs are already included in the driver; or going a bit deeper, so that the driver knows which is supposed to be the QMI/WWAN interface number that should be used (different vendors have different USB interface layouts). In addition to the problem of requiring to look for the GobiNet driver most suitable for your device, having the drivers maintained out-of-tree means that they need to provide a single set of sources for a very long set of kernel versions. The sources, therefore, are full of #ifdefs enabling/disabling different code paths depending on the kernel version targeted, so maintaining it gets to be much more complicated than if they just had it in-tree.

Note: Interestingly, we’ve already seen fixes that were first implemented in qmi_wwan ‘ported’ to GobiNet variants.

#### Complexity

The qmi_wwan driver is simple; it will just get a USB interface and split it into a QMI-capable /dev/cdc-wdm port (through the cdc-wdm driver) and a wwan network interface. As the kernel only provides basic transport to and from the device, it is left to user-space the need to manage the QMI protocol completely, including service client allocations/releases as well as the whole internal CTL service. Note, though, that this is not a problem; user-space tools like libqmi will do this work nicely.

The GobiNet driver is instead very complex. The driver also exposes a control interface (e.g. /dev/qcqmi) and a network interface, but all the work that is done through the internal CTL service is done at kernel-level. So all client allocations/releases for the different services are actually performed internally, not exposed to user-space. Users will just be able to request client allocations via ioctl() calls, and client releases will be automatically managed within the kernel. In general, it is never advisable to have such a complex driver. As complexity of a driver increases, so does the likelyhood of having errors, and crashes in a driver could affect the whole kernel. Quoting Bjørn, the smaller the device driver is, the more robust the system is.

Note: Some Android devices also support QMI-capable chipsets through GobiNet (everything hidden in the kernel and the RIL). In this case, though, you may see that shared memory can also be used to talk to the QMI device, instead of a /dev/qcqmi port.

#### Device initialization

One of the first tasks that is done while communicating with the Gobi device is to set it up (e.g. decide which link-layer protocol to use in the network interface) and make sure that the modem is ready to talk QMI. In the case of the GobiNet driver, this is all done in kernel-space; while in the case of qmi_wwan everything can be managed in user-space. The libqmi library allows several actions to be performed during device initialization, including the setting of the link-layer protocol to use. There are, for example, models from Sierra Wireless (like the new MC7305) which expose by default one QMI+network interface (#8) configured to use 802.3 (ethernet headers) and another QMI+network interface (#10) configured to use raw IP (no ethernet headers). With libqmi, we can switch the second one to use 802.3, which is what qmi_wwan expects, thus allowing us to use both QMI+net pairs at the same time.

#### Multiple processes talking QMI

One of the problems of qmi_wwan is that only one process is capable of using the control port at a given time. The GobiNet driver, instead, allows multiple processes to concurrently access the device, as each process would get assigned different QMI clients with different client IDs directly from the kernel, hence, not interfering with each other. In order to handle this issue, libqmi (since version 1.8) was extended to implement a ‘qmi-proxy’ process which would be the only one accessing the QMI port, but which would allow different process to communicate with the device concurrently (by sharing and synchronizing the CTL service among the connected peers).

#### User-space libraries

The GobiNet driver is designed to be used along with Qualcomm’s C++ GobiAPI library in user-space. On top of this library, other manufacturers (like Sierra Wireless) provide additional libraries to use specific features of their devices. This GobiAPI library will handle itself all the ioctl() calls required to e.g. allocate new clients, and will also provide a high level API to access the different QMI services and operations in the device.

In the case of the qmi_wwan driver, as already said, there are several implementations which will let you talk QMI with the device. libqmi, which I maintain, is one of them. libqmi provides a GLib-based C library, and therefore it exposes objects and interfaces which provide access to the most used QMI services in any kind of device. The CTL service, the internal one which was managed in the kernel by GobiNet, will be managed internally by libqmi and therefore mostly hidden to the users of the library.

Note: It is not (yet) possible to mix GobiAPI with qmi_wwan and e.g. libqmi with GobiNet. Therefore, it is not (yet) possible to use libqmi or qmicli in e.g. an Android device with a QMI-capable chipset.

#### User-space command line tools

I am no really aware of any general purpose command line tool developed to be used with the GobiNet driver (well, firmware loader applications, but those are not general purpose). The lack of command line tools may be likely due to the fact that, as QMI clients are released automatically by the GobiNet kernel, it is not easy (if at all possible) to leave a QMI client allocated and re-use it over and over by a command line tool which executes an action and exits.

With qmi_wwan, though, as clients are not automatically released, command line tools are much easier to handle. The libqmi project includes a qmicli tool which is able to execute independent QMI requests in each run of the program, even re-using the same QMI client in each of the runs if needed. This is especially important when launching a connection, as the WDS client which executes the “Start Network” command must be kept registered as long as the connection is open, or otherwise the connection will get dropped.

The process of loading new firmware into a QMI-based device is not straightforward. It involves several interactions at QMI-level, plus a QDL based download of the firware to the device (kind of what gobi_loader does for Gobi 2K). Sadly, there is not yet a way to perform this operation when using qmi_wwan and its user-space tools. If you’re in the need of updating the firmware of the device, the only choice left is to use the GobiNet driver plus the vendor-provided programs.

#### Support

One of the advantages of the GobiNet driver is that every manufacturer will (should) give direct support for their devices if that kernel driver is used. Actually, there are vendors which will only give support for the hardware if their driver is the one in use. I’m therefore assuming that GobiNet may be a good choice for companies if they want to rely in the vendor-provided support, but likely not for standard users which just happen to have a device of this kind in their systems.

But, even if it is not the official support, you can anyway still get in touch with the libqmi mailing list if you’re experiencing issues with your QMI device; or contact companies or individuals (e.g. me!) which provide commercial support for the qmi_wwan driver and libqmi/qmicli integration needs.

Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Planets Tagged: Gobi, GobiNet, libqmi, linux, QMI

Two months ago in April ’14 I’ve been in San Francisco to meet with other FOSS developers and discuss current projects. There were several events, including the first GNOME Westcoast Summit and a systemd hackfest at Pantheon. I’ve been working on a lot of stuff lately and it was nice to talk directly to others about it. I wrote in-depth articles (on this blog) for the most interesting stories, but below is a short overview of what I focused on in SF:

• memfd: My most important project currently is memfd. We fixed several bugs and nailed down the API. It was also nice to get feedback from a lot of different projects about interesting use-cases that we didn’t think of initially. As it turns out, file-sealing is something a lot of people can make great use of.
• MiracleCast: For about half a year I worked on the first Open-Source implementation of Miracast. It’s still under development and only working Sink-side, but there are plans to make it work Client-side, too. Miracast allows to replace HDMI cables with wireless-solutions. You can connect your monitor, TV or projector via standard wifi to your desktop and use it as mirror or desktop-extension. The monitor is sink-side and MiracleCast can already provide a full Miracast stack for it. However, for the more interesting Source-side (eg., a Gnome-Desktop) I had a lot of interesting discussions with Gnome developers how to integrate it. I have some prototypes running locally, but it will definitely take a lot longer before it works properly. However, the current sink-side implementation has a latency of approx. 50ms and can run 30fps 1080p. This is already pretty impressive and on-par with proprietary solutions.
• kdbus: The new general-purpose IPC mechanism is already fleshed out, but we spent a lot of time fixing races in the code and doing some general code review. It is a very promising project and all of the criticism I’ve heard so far was rubbish. People tend to rant about moving dbus in the kernel, even though kdbus really has nearly nothing to do with dbus, except that it provides an underlying data-bus infrastructure. Seriously, the helpers used for kernel-mode-setting without including the driver-specific code is already much bigger than kdbus… and in my opinion, kdbus will make dbus a lot more efficient and appealing to new developers.
• GPU: GPU-switching, offload-GPUs and USB/wifi display-controllers are few of the many new features in the graphics subsystem. They’re mostly unsupported in any user-space, so we decided to change that. It’s all highly technical and the way how it is supposed to work is fairly obvious. Therefore, I will avoid discussing the details here. Lets just say, on-demand and live GPU-switching is something I’m making possible as part of GSoC this summer.
• User-bus: This topic sounds fairly boring and technical, but it’s not. The underlying question is: What happens if you log in multiple times as the same user on the same system? Currently, a desktop system either rejects multiple logins of the same user or treats them as separate, independent logins. The second approach has the problem that many applications cannot deal with this. Many per-user resources have to be shared (like the home-directory). Firefox, for instance, cannot run multiple times for the same user. However, no-one wants to prevent multiple logins of the same user, as it really is a nice feature. Therefore, we came up with a hybrid aproach which basically boils down to a single session shared across all logins of the same user. So if you login twice, you get the same screen for both logins sharing the same applications. The window-manager can put you on a separate virtual desktop, but the underlying session is basically the same. Now if you do the same across multiple seats, you simply merge both sessions of these seats into a single huge session with the screen mirrored across all assigned monitors. A more in-depth article will follow once the details have been figured out.

A lot of the things I worked on deal with the low-level system and are hardly visible to the average Gnome user. However, without a proper system API, there’s no Gnome and I’m very happy the Gnome Foundation is acknowledging this by sponsoring my trip to SF: Thanks a lot! And hopefully I’ll see you again next year!

For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

• ftruncate(2) to change the file size
• read(2), write(2) and all its derivatives to inspect/modify file contents
• mmap(2) to get a direct memory-mapping
• dup(2) to duplicate file-descriptors

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

• memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
• There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

## File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

• F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
• F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
• F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
• F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

• IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
• Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

## Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.

Linus decided to have a bit fun with the 3.16 merge window and the 3.15 release, so I'm a bit late with our regular look at the new stuff for the Intel graphics driver.
First things first, Baytrail/Valleyview has finally gained support for MIPI DSI panels! Which means no more ugly hacks to get machines like the ASUS T100 going for users and no more promises we can't keep from developers - it landed for real this time around. Baytrail has also seen a lot of polish work in e.g. the infoframe handling, power domain reset, ...

Continuing on the new hardware platform this release features the first version of our prelimary support for Cherryview. At a very high level this combines a Gen8 render unit derived from Broadwell with a beefed-up Valleyview display block. So a lot of the enabling work boiled down to wiring up existing code, but of course there's also tons of new code to get all the details rights. Most of the work has been done by Ville and Chon Ming Lee with lots of help from other people.

Our modeset code has also seen lots of improvements. The user-visible feature is surely support for large cursors. On high-dpi panels 64x64 simply doesn't cut it and the kernel (and latest SNA DDX) now support up to the hardware limit of 256x256. But there's also been a lot of improvements under the hood: More of Ville's infrastructure for atomic pageflips has been merged - slowly all the required pieces like unified plane updates for modeset, two stage watermark updates or atomic sprite updates are falling into place. Still a lot of work left to do though. And the modesetting infrasfrastucture has also seen a bit of work by the almost complete removal of the ->mode_set hooks. We need that for both atomic modeset updates and for proper runtime PM support.

On that topic: Runtime power management is now enabled for a bunch of our recent platforms - all the prep work from Paulo Zanoni and Imre Deak in the past few releases has finally paid off. There's still leftovers to be picked up over the coming releases like proper runtime PM support for DPMS on all platforms, addressing a bunch of crazy corner cases, rolling it out on the newer platforms like Cherryview or Broadwell and cleaning the code up a bit. But overall we're now ready for what the marketing people call "connected standy", which means that power consumption with all devices turned off through runtime pm should be as low as when doing a full system suspend. It crucially relies upon userspace not sucking and waking the cpu and devices up all the time, so personally I'm not sure how well this will work out really.

Another piece for proper atomic pageflip support is the universal primary plane support from Matt Roper. Based upon his DRM core work in 3.15 he now enabled the universal primary plane support in i915 properly. Unfortunately the corresponding patches for cursor support missed 3.16. The universal plane support is hence still disabled by default. For other atomic modeset work a shout-out goes to Rob Clark who's locking conversion to wait/wound mutexes for modeset objects has been merged.

On the GEM side Chris Wilson massively improved our OOM handling. We are now much better at surviving a crash against the memory brickwall. And if we don't and indeed run out of memory we have much better data to diagnose the reason for the OOM. The top-down PDE allocator from Ben Widawsky better segregates our usage of the GTT and is one of the pieces required before we can enable full ppgtt for production use. And the command parser from Brad Volkin is required for some OpenGL and OpenCL features on Haswell. The parser itself is fully merged and ready, but the actual batch buffer copying to a secure location missed the merge window and hence it's not yet enabled in permission granting mode.

The big feature to pop the champagne though is the userptr support from Chris - after years I've finally run out of things to complain about and merged it. This allows userspace to wrap up any memory allocations obtained by malloc() (or anything else backed by normal pages) into a GEM buffer object. Useful for faster uploads and downloads in lots of situation and currently used by the DDX to wrap X shmem segments. But OpenCL also wants to use this.

We've also enabled a few Broadwell features this time around: eDRAM support from Ben, VEBOX2 support from Zhao Yakui and gpu turbo support from Ben and Deepak S.

And finally there's the usual set of improvements and polish all over the place: GPU reset improvements on gen4 from Ville, prep work for DRRS (dynamic refresh rate switching) from Vandana, tons of interrupt and especially vblank handling rework (from Paulo and Ville) and lots of other things.

In Solaris 11.1, I updated the system headers to enable use of several attributes on functions, including noreturn and printf format, to give compilers and static analyzers more information about how they are used to give better warnings when building code.

In Solaris 11.2, I've gone back in and added one more attribute to a number of functions in the system headers: __attribute__((__deprecated__)). This is used to warn people building software that they’re using function calls we recommend no longer be used. While in many cases the Solaris Binary Compatibility Guarantee means we won't ever remove these functions from the system libraries, we still want to discourage their use.

I made passes through both the POSIX and C standards, and some of the Solaris architecture review cases to come up with an initial list which the Solaris architecture review committee accepted to start with. This set is by no means a complete list of Obsolete function interfaces, but should be a reasonable start at functions that are well documented as deprecated and seem useful to warn developers away from. More functions may be flagged in the future as they get deprecated, or if further passes are made through our existing deprecated functions to flag more of them.

Header Interface Deprecated by Alternative Documented in
<door.h> door_cred(3C) PSARC/2002/188 door_ucred(3C) door_cred(3C)
<kvm.h> kvm_read(3KVM), kvm_write(3KVM) PSARC/1995/186 Functions on kvm_kread(3KVM) man page kvm_read(3KVM)
<stdio.h> gets(3C) ISO C99 TC3 (Removed in ISO C11), POSIX:2008/XPG7/Unix08 fgets(3C) gets(3C) man page, and just about every gets(3C) reference online from the past 25 years, since the Morris worm proved bad things happen when it’s used.
<unistd.h> vfork(2) PSARC/2004/760, POSIX:2001/XPG6/Unix03 (Removed in POSIX:2008/XPG7/Unix08) posix_spawn(3C) vfork(2) man page.
<utmp.h> All functions from getutent(3C) man page PSARC/1999/103 utmpx functions from getutentx(3C) man page getutent(3C) man page
<varargs.h> varargs.h version of va_list typedef ANSI/ISO C89 standard <stdarg.h> varargs(3EXT)
<volmgt.h> All functions PSARC/2005/672 hal(5) API volmgt_check(3VOLMGT), etc.
<sys/nvpair.h> nvlist_add_boolean(3NVPAIR), nvlist_lookup_boolean(3NVPAIR) PSARC/2003/587 nvlist_add_boolean_value, nvlist_lookup_boolean_value nvlist_add_boolean(3NVPAIR) & (9F), nvlist_lookup_boolean(3NVPAIR) & (9F).
<sys/processor.h> gethomelgroup(3C) PSARC/2003/034 lgrp_home(3LGRP) gethomelgroup(3C)
<sys/stat_impl.h> _fxstat, _xstat, _lxstat, _xmknod PSARC/2009/657 stat(2) old functions are undocumented remains of SVR3/COFF compatibility support

If the above table is cut off when viewing in the blog, try viewing this standalone copy of the table.

# To See or Not To See

To see these warnings, you will need to be building with either gcc (versions 3.4, 4.5, 4.7, & 4.8 are available in the 11.2 package repo), or with Oracle Solaris Studio 12.4 or later (which like Solaris 11.2, is currently in beta testing). For instance, take this oversimplified (and obviously buggy) implementation of the cat command:

#include <stdio.h>

int main(int argc, char **argv) {
char buf[80];

while (gets(buf) != NULL)
puts(buf);
return 0;
}

Compiling it with the Studio 12.4 beta compiler will produce warnings such as:
% cc -V
cc: Sun C 5.13 SunOS_i386 Beta 2014/03/11
% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221


The exact warning given varies by compilers, and the compilers also have a variety of flags to either raise the warnings to errors, or silence them. Of couse, the exact form of the output is Not An Interface that can be relied on for automated parsing, just shown for example.

gets(3C) is actually a special case — as noted above, it is no longer part of the C Standard Library in the C11 standard, so when compiling in C11 mode (i.e. when __STDC_VERSION__ >= 201112L), the <stdio.h> header will not provide a prototype for it, causing the compiler to complain it is unknown:

% gcc -std=c11 gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: implicit declaration of function ‘gets’ [-Wimplicit-function-declaration]
while (gets(buf) != NULL)
^

The gets(3C) function of course is still in libc, so if you ignore the error or provide your own prototype, you can still build code that calls it, you just have to acknowledge you’re taking on the risk of doing so yourself.

# Solaris Studio 12.4 Beta

% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221

% cc -errwarn=E_DEPRECATED_ATT gets_test.c
"gets_test.c", line 6:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221
cc: acomp failed for gets_test.c
This warning is silenced in the 12.4 beta by cc -erroff=E_DEPRECATED_ATT
No warning is currently issued by Studio 12.3 & earler releases.

# gcc 3.4.3

% /usr/sfw/bin/gcc gets_test.c
gets_test.c: In function main':
gets_test.c:6: warning: gets' is deprecated (declared at /usr/include/iso/stdio_iso.h:221)

Warning is completely silenced with gcc -Wno-deprecated-declarations

# gcc 4.7.3

% /usr/gcc/4.7/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]

% /usr/gcc/4.7/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

# gcc 4.8.2

% /usr/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]
while (gets(buf) != NULL)
^

% /usr/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
while (gets(buf) != NULL)
^
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

## Global Graphics Translation Tables

Here goes the basics of how the GEN GPU interacts with memory. It will be focused on the lowest levels of the i915 driver, and the hardware interaction. My hope is that by going through this in excruciating detail, I might be able to take more liberties in the future posts.

## What are the Global Graphics Translation Table

The graphics translation tables provide the address mapping from the GPU’s virtual address space to a physical address1. The GTT is somewhat a relic of the AGP days ( GART) with the distinction being that the GTT as it pertains to Intel GEN GPUs has logic that is contained within the GPU, and does not act as a platform IOMMU. I believe (and wikipedia seems to agree) that GTT and GART were used interchangeably in the AGP days.

## GGTT architecture

Each element within the GTT is an entry, and the initialism for each entry is a, “PTE” or page table entry. Much of the required initialization is handled by the boot firmware. The i915 driver will get any required information from the initialization process via PCI config space, or MMIO.

Example illustrating Intel/GEN memory organization:

### Location

The table is located within system memory, and is allocated for us by the BIOS or boot firmware. To clarify the docs a bit, GSM is the portion of stolen memory for the GTT, DSM is the rest of stolen memory used for misc things. DSM is the stolen memory referred to by the current i915 code as “stolen memory.” In theory we can get the location of the GTT from MMIO MPGFXTRK_CR_MBGSM_0_2_0_GTTMMADR (0×108100, 31:20), but we do not do that. The register space, and the GTT entries are both accessible within BAR0 (GTTMMADR).

All the information can be found in Volume 12, p.129: UNCORE_CR_GTTMMADR_0_2_0_PCI. Quoting directly from the HSW spec, “The range requires 4 MB combined for MMIO and Global GTT aperture, with 2MB of that used by MMIO and 2MB used by GTT. GTTADR will begin at GTTMMADR 2 MB while the MMIO base address will be the same as GTTMMADR.”

In the below code you can see we take the address in the PCI BAR and add half the length to the base. For all modern GENs, this is how things are split in the BAR.

/* For Modern GENs the PTEs and register space are split in the BAR */
(pci_resource_len(dev->pdev, 0) / 2);



One important thing to notice above is that the PTEs are mapped in a write-combined fashion. Write combining makes sequential updates (something which is very common when mapping objects) significantly faster. Also, the observant reader might ask, ‘why go through the BAR to update the PTEs if we have the actual physical memory location.’ This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates. If this weren’t required, a nice optimization might be to update all the entries as once with the CPU, and then go tell the GPU to invalidate the TLBs.

### Size

Size is a bit more straight forward. We just read the relevant PCI offset. In the docs: p.151 GSA_CR_MGGC0_0_2_0_PCI offset 0×50, bits 9:8

And the code is even more straightforward.

static inline unsigned int gen6_get_total_gtt_size(u16 snb_gmch_ctl)
{
snb_gmch_ctl >>= SNB_GMCH_GGMS_SHIFT;
return snb_gmch_ctl << 20;
}
gtt_size = gen6_get_total_gtt_size(snb_gmch_ctl);
gtt_total = (gtt_size / sizeof(gen6_gtt_pte_t)) << PAGE_SHIFT;


### Layout

The PTE layout is defined by the PRM and as an example, can be found on page 35 of HSW – Volume 5: Memory Views. For convenience, I have reconstructed the important part here:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:322 Cacheability Control[2:0] Valid

The valid bit is always set for all GGTT PTEs. The programming notes tell us to do this (also on page 35 of HSW – Volume 5: Memory Views)3.

### Putting it together

As a result, of what we’ve just learned, we can make up a function to write the PTEs.:

/**
* gen_write_pte() - Write a PTE entry
* @dev_priv:	The driver private structure
* @entry:	Which PTE in the table to update
* @cache_type: Preformatted cache type. Varies by platform
*/
static void
unsigned int entry, uint32_t cache_type)
{
/* Total size, divided by the PTE size is the max entry */
BUG_ON(entry >= (gtt_total / 4);
/* We can only use 38 address bits */

cache_type |
1;
iowrite32(pte, dev_priv->gtt.gsm + (entry * 4));
}


### Example

Let’s analyze a real HSW running something. We can do this with the tool in the intel-gpu-tools suite, intel_gtt, passing it the -d option4.

GTT offset |                 PTEs
--------------------------------------------------------
0x000000 | 0x0ee23025 0x0ee28025 0x0ee29025 0x0ee2a025
0x004000 | 0x0ee2b025 0x0ee2c025 0x0ee2d025 0x0ee2e025
0x008000 | 0x0ee2f025 0x0ee30025 0x0ee31025 0x0ee32025
0x00c000 | 0x0ee33025 0x0ee34025 0x0ee35025 0x0ee36025
0x010000 | 0x0ee37025 0x0ee13025 0x0ee1a025 0x0ee1b025
0x014000 | 0x0ee1c025 0x0ee1d025 0x0ee1e025 0x0ee1f025
0x018000 | 0x0ee80025 0x0ee81025 0x0ee82025 0x0ee83025
0x01c000 | 0x0ee84025 0x0ee85025 0x0ee86025 0x0ee87025


And just to continue beating the dead horse, let’s breakout the first PTE:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid
0xee23000 0 0×2 0×2 1

Physical address: 0x20ee23000
Cache type: 0x2 (WB in LLC Only – Aged "3")
Valid: yes


### Definition of a GEM BO

We refer to virtually contiguous locations which are mapped to specific graphics operands as one of, objects, buffer objects, BOs, or GEM BOs.

In the i915 driver, the verb, “bind” is used to describe the action of making a GPU virtual address range point to the valid backing pages of a buffer object.5 The driver also reuses the verb, “pin” from the Linux mm, to mean, prevent the object from being unbound.

Example of  a “bound” GPU buffer

### Scratch Page

We’ve already talked about the scratch page twice, albeit briefly. There was an indirect mention, and of course in the image directly above. The scratch page is a single page allocated from memory which every unused GGTT PTE will point to.

To the best of my knowledge, the docs have never given a concrete explanation for the necessity of this, however one might assume unintentional  behavior should the GPU talk a page fault. One would be right to interject at this point with the fact that by the very nature of DRI drivers, userspace can almost certainly find a way to hang the GPU. Why should we bother to protect them against this particular issue? Given that the GPU has undefined (read: Not part of the behavioral specification) prefecthing behavior, we cannot guarantee that even a well behaved userspace won’t invoke page faults6. Correction: after writing this, I went and looked at the docs. They do explain exactly which engines can, and cannot take faults. The “why” seems to be missing however.

## Mappings and the aperture

### The Aperture

First we need to take a bit of a diversion away from GEN graphics (which to repeat myself, are all of the shared memory type). If one thinks of traditional discrete graphics devices, there is always embedded GPU memory. This poses somewhat of an issue given that all end user applications require the CPU to run. The CPU still dispatches work to the GPU, and for cases like games, the event loop still runs on the CPU. As a result, the CPU needs to be able to both read, and write to memory that the GPU will operate on. There are two common solutions to this problem.
• DMA engine
• Need to deal with asynchronous (and possibly out of order) completion. Latencies involved with both setup and completion notification.
• Need to actually program the interface via MMIO, or send a command to the GPU7
• Unlikely to re-arrange or process memory
• tile/detile surfaces8.
• can’t take page faults, pages must be pinned
• No size restrictions (I guess that’s implementation specific)
• Completely asynchronous – the CPU is free to do whatever else needs doing.
• Aperture
• Synchronous. Not only is it slow, but the CPU has to hand hold the data transfer.
• Size limited/limited resource. There is really no excuse with PCIe and modern 64b platforms why the aperture can’t be as large as needed, but for Intel at least, someone must be making some excuses, because 512MB is as large as it gets for now.
• Can swizzle as needed (for various tiling formats).
• Simple usage model. Particularly for unified memory systems.

Moving data via the aperture

Moving data via DMA

The Intel GEN GPUs have no local memory9. However, DMA has very similar properties to writing the backing pages directly on unified memory systems. The aperture is still used for accesses to tiled memory, and for systems without LLC. LLC is out of scope for this post.

### GTT and MMAP

There are two distinct interfaces to map an object for reading or writing. There are lots of caveats to the usage of these two methods. My point isn’t to explain how to use them (libdrm is a better way to learn to use them anyway). Rather I wanted to clear up something which confused me early on.

The first is very straightforward, and has behavior I would have expected.

struct drm_i915_gem_mmap {
#define DRM_I915_GEM_MMAP       0x1e
/** Handle for the object being mapped. */
__u32 handle;
/** Offset in the object to map. */
__u64 offset;
/**
* Length of data to map.
*
* The value will be page-aligned.
*/
__u64 size;
/**
* Returned pointer the data was mapped at.
*
* This is a fixed-size type for 32/64 compatibility.
*/
};

// let bo_handle = some valid GEM BO handle to a 4k object
// What follows is a way to map the BO, and write something
memset(&arg, 0, sizeof(arg));
arg.handle = bo_handle;
arg.offset = 0;
arg.size = 4096;
ioctl(fd, DRM_IOCTL_I915_GEM_MMAP, &arg);


I might be projecting my ineptitude on the reader, but, it’s the second interface which caused me a lot of confusion, and one which I’ll talk briefly about. The interface itself is even simpler smaller:

#define DRM_I915_GEM_MMAP_GTT   0x24
struct drm_i915_gem_mmap_gtt {
/** Handle for the object being mapped. */
__u32 handle;
/**
* Fake offset to use for subsequent mmap call
*
* This is a fixed-sizeso [sic] type for 32/64 compatibility.
*/
__u64 offset;
};


Why do I think this is confusing? The name itself never quite made sense – what use is there in mapping an object to the GTT? Furthermore, how does mapping it to the GPU allow me to do anything with in from userspace. For one thing, I had confused, “mmap” with, “map.” The former really does identify the recipient (the CPU, not the GPU) of the mapping. If follows the conventional use of mmap(). The other thing is that the interface has an implicit meaning. A GTT map here actually means a GTT mapping within the aperture space. Recall that the aperture is a subset of the GTT which can be accessed through a PCI BAR. Therefore, what this interface actually does is return a token to userspace which can be mmap’d to get the CPU mapping (through the BAR, to the GPU memory). Like I said before, there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call. All of these things will not be discussed.

Conceptualized view of mmap and mmap_gtt

Finally, here is a snippet of code from intel-gpu-tools that hopefully just encapsulates what I said and drew.

mmap_arg.handle = handle;
assert(drmIoctl(fd, DRM_IOCTL_I915_GEM_MMAP_GTT, &mmap_arg) == 0);
assert(mmap64(0, OBJECT_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_arg.offset));


## Summary

This is how modern Intel GPUs deal with system memory on all platforms without a PPGTT (or if you disable it via module parameter). Although I happily skipped over the parts about tiling, fences, and cache coherency, rest assured that if you understood all of this post, you have a good footing. Going over the HSW docs again for this post, I am really pleased with how much Intel has improved the organization, and clarity. I highly encourage you to go off and read those for any missing pieces.

Please let me know about any bugs, or feature requests in this post. I would be happy to add them as time allows.

Here are links to SVGs of all the images I created. Feel free to use them how you please.

1. when using the VT-d the address is actually an I/O address rather than the physical address

2. Previous gens went to 39

3. I have submitted two patch series, one of which has been reverted, the other, never merged, which allow invalid PTEs for debug purposes

4. intel_gtt is currently not supported for GEN8+. If someone wants to volunteer to update this tool for gen8, please let me know

5. I’ve fought to call this operation, “map”

6. Empirically (for me), GEN7+ GPUs have behaved themselves quite well after taking the page fault. I very much believe we should be using this feature as much as possible to help userspace driver developers

7. I’ve previously written a post on how this works for Intel

8. Sorry people, this one is too far out of scope for and explanation in this post. Just trust it’s a limitation if you don’t understand. Daniel Vetter probably wrote an article about it if you feel like heading over to his blog

9. There are several distinct caches on all modern GEN GPUs, as well as eDRAM for Intel’s Iris Pro. The combined amount of this “local” memory is actually greater than many earlier discrete GPUs

 June 05, 2014

I don’t know if I’ve ever eaten my own dogfood that smells this risky.

A few days ago, I published patches to support dynamic page table allocation and tear-down in the i915 driver http://lists.freedesktop.org/archives/intel-gfx/2014-March/041814.html. This work will eventually help us support expanded page tables (similar to how things work for normal Linux page tables). The patches rely on using full PPGTT support, which still requires some work to get enabled by default. As a result, I’ll be carrying around this work for quite a while. The patches provide a lot of opportunity to uncover all sorts of weird bugs we’ve never seen due to the more stressful usage of the GPU’s TLBs. To avoid the patches getting too stale, and to further the bug extermination, I’ve figured, why not run it myself?

If you feel like some serious pain, or just want to help me debug it, give it a go – there should be absolutely no visible gain for you, only harm. You can either grab the patches from the mailing list, patchwork, or my branch.  Make sure to turn on full PPGTT support with i915.enable_ppgtt=2. If you do decide to opt for the pain, you can take comfort in the fact that you’re helping get the next big piece of prep work in place.

The question is, how long before I get sick of this terrible dogfood? I’m thinking by Monday I’ll be finished

This is a short and vague glimpse to the interfaces that the Linux kernel offers to user space for display and graphics management, from the history to what is hot and new, to what might perhaps be coming after. The topic came current for me when I started preparing Weston for global thermonuclear war.

### The pre-history

In the age of dragons, kernel mode setting did not exist. There was only user space mode setting, where the job of the kernel driver (if any) was simply to give user space direct access to the graphics card registers. A user space driver (well, Xorg video DDX, really, err... or what it was at the time of XFree86) would then poke the card registers to set a mode. The kernel had no idea of anything.

The kernel DRM infrastructure was started as an out-of-tree kernel module for cooperating between multiple programs wanting to access the graphics card's resources. Later it was (partially?) merged into the kernel tree (the year is a lie, 2.3.18 came out in 1999), and much much later it was finally deleted from the libdrm repository.

### The middle age

For some time, the kernel DRM existed alongside user space mode setting. It was a dark time full of crazy hacks to keep it all together with duct tape, barbwire and luck. GPUs and hardware accelerated OpenGL started to come up.

### The new age

With the invent of kernel mode setting (KMS), the DRM kernel drivers got in charge of the graphics card resources: outputs, video modes, memory allocations, hotplug! User space mode setting became obsolete and was eventually killed. The kernel driver was finally actually in control of the graphics hardware.

KMS probably started with just setting the main framebuffer (primary plane) for each "CRTC" and programming the video mode. A CRTC is for "cathode-ray tube controller", but essentially means a block that reads memory (a framebuffer) and produces a bitstream according to video mode timings. The bitstream is directed into an "encoder", which turns it into a proper physical/analogue signal, like VGA or digital DVI. The signal then exits the graphics card though a "connector". CRTC, encoder, and connector are the basic concepts in KMS API. Quite often these can be combined in some restricted ways, like a single CRTC feeding two encoders for clone mode.

Even ancient hardware supported hardware cursors: a small sprite that was composited into the outgoing video signal on the fly, which meant that it was very cheap to move around. Cursor being so special, and often with funny color format (alpha!), got its very own DRM ioctl.

There were also hardware overlays (additional or secondary planes) on some hardware. While the primary framebuffer covers the whole display, an overlay is another buffer (just like the cursor) that gets mixed into the bitstream at the CRTC level. It is like basic compositing done on the scanout hardware level. Overlays usually had additional benefits, for example they could apply scaling or color space conversion (hello, video players) very efficiently. Overlays being different, they too got their very own DRM ioctls.

The KMS user space ABI was anything but atomic. With the X11 tradition, it wasn't too important how to update the displays, as long as the end result eventually was what you wanted. Race conditions in content updates didn't matter too much either, as X was racy as hell anyway. You update the CRTC. Then you update each overlay. You might update the cursor, too. By luck, all these updates could hit the same vblank. Or not. Or you don't hit vblank at all, and get tearing. No big deal, as X was essentially all about front-buffer rendering anyway. (And then there were huge efforts in trying to fix it all up with X, GLX, Mesa and GL-compositors, and avoid tearing, and it ended up complicated.)

With the advent of X compositing managers, that did not play well with the  awkward X11 protocol (Xv) or the hardware overlays, and with rise of the  GPU power and OpenGL, it was thought that hardware overlays would  eventually die out. Turned out the benefits of hardware overlays were too great to abandon, and with Wayland we again have a decent chance to make the most of them while still enjoying compositing.

### The global thermonuclear war (named after a git branch by Rob Clark)

The quality of display updates became important. People do not like tearing. Someone actually wanted to update the primary framebuffer and the overlays on the same vblank, guaranteed. And the cursor as the cherry on top.

We needed one ABI to rule them all.

Universal planes brings framebuffers (primary planes), overlays (secondary planes) and cursors (cursor planes) together under the same API. No more type specific ioctls, but common ioctls shared by them all. As these objects are still somewhat different, overlays having wildly differing features and vendors wanting to expose their own stuff, object properties were invented.

An object property is essentially a {key, value} pair. In the API, the name of a key is a string. Each object has its own set of keys. To use a key, you must know it by name, fetch the handle, and then use the handle when setting the value. Handles seem to be per-object, so make sure to fetch them separately for each.

Atomic mode setting and nuclear pageflip are two sides of the same feature. Atomicity is achieved by gathering a set of property changes, and then pushing them all into the kernel in a single ioctl call. Then that call either succeeds or fails as a whole. Libdrm offers a drmModePropertySet for gathering the changes. Everything is exposed as properties: the attached FB, overlay position, video mode, etc.

Atomic mode setting means setting the output modes of a single graphics device, more or less. Devices may have hard to express limitations. A simple example is the available scanout memory bandwidth: You can drive either two mid-resolution outputs, or one high-resolution output. Or maybe some crtc-encoder-connector combination is not possible with a particular other combination for another output. Collecting the video mode, encoder and connector setup over the whole grahics card into a single operation avoids flicker. Either the whole set succeeds, or it fails. Without atomic mode setting, changing multiple outputs would not only take longer, but if some step failed, you'd have to undo all earlier steps (and hope the undo steps don't fail). Plus, there would be no way to easily test if a certain combination is possible. Atomic mode setting fixes all this.

Nuclear pageflip is about synchronizing the update of a single output (monitor) and making that atomic. This means that when user space wants to update the primary framebuffer, move the cursor, and update a couple of overlays, all those changes happen at the same vblank. Again it all either succeeds or fails. "Every frame is perfect."

### And then there shall be ponies (at the end of the rainbow)

Once the global thermonuclear war is over, we have the perfect ABI for driving display updates.

Well, almost. Enter NVidia G-Sync, or AMD's FreeSync which is actually backed by a VESA standard. Dynamically variable refresh rate. We have no way yet for timing display updates in DRM. All we can do is kick out a display update, and it will hopefully land on the next vblank, whenever that is. But we can't tell the DRM when we would like it to be. Everything so far assumes, that the display refresh rate is a constant, apart from an explicit mode switch. Though I have heard that e.g. Chrome for Intel (i915, LVDS/eDP reclocking) has some hacks that opportunistically drops the refresh rate to save power.

There is also a culprit in the DRM of today (Jun 3rd, 2014). You can schedule a pageflip, but if you have pending rendering on that framebuffer for the same GPU as were you are presenting it, the pageflip will not happen until the rendering completes. And you do not know when it will complete, which means you do not know if you will hit the very next vblank or something later.

If the rendering GPU is not the same graphics device that presents the framebuffer, you do not get synchronization at all. That means that you may be scanning out an incomplete rendering for a frame or two, or you have to stall the GPU to make sure it is done before scheduling the page flip. This should be fixed with the fences related to dma-bufs (Hi, Maarten Lankhorst).

And so the unicorn keeps on running.
 May 30, 2014

Last week was the OpenStack Design Summit in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments.

The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me!

On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week.

Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our Etherpads instances. Here is a short summary of the sessions I've attended.

# Scaling the central agent

I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the tooz library that we worked on at eNovance during the last 6 months.

Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno.

For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted.

# Test strategy

The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet.

# Complex queries and per-user/project data collection

This session, led by Ildikó Váncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API.

# Rethinking Ceilometer as a Time-Series-as-a-Service

This was my main session, the reason we met on Monday for a few hours, and one of the most promising session – I hope – of the week.

It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles.

Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called Gnocchi, which provides exactly that functionality in a hopefully scalable way.

Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on Pandas and Swift. The canonical resource indexer driver is based on SQLAlchemy.

The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release.

# Revisiting the Ceilometer data model

This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way.

Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver.

# Ceilometer devops session

We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts.

# SNMP inspectors

This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer.

# Alarm and logs improvements

This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems.

# Conclusion

Considering the current QA problems with Ceilometer, Eoghan Glynn, the new Project Technical Leader for Ceilometer, clearly indicated that this will be the main focus of the release cycle.

Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.

 May 29, 2014

I spent last weekend in Beijing attending GNOME Asia 2014; yeah, long trip from Europe just for 3 days, but it was totally worth it. The worst part of it was of course fighting jet lag when I arrived, and fighting it again 3 days later when I came back to Spain :) The conference was really well organized [1], so kudos to all the local team!

After a quick sleep on Friday morning, I attended the development and documentation training sessions that Kat, André and Dave gave. They were quite interesting, especially since I’m not involved in the real user documentation that GNOME provides. I have to say that these guys do an amazing job, not only teaching during conferences, but also through the whole year.

There are, from my point of view, two main ways of learning new things:

• The ‘engineer’ way: Learning things as you need them, what you would do when you start writing an application and looking for examples of how to do what you want to do (autotools, anyone?). It is a very ‘engineer’ way, as you pick black boxes that you’ll use to build something bigger, while not fully understanding what the black box does inside.
• The ‘scientific’ way: When you learn something in order to fully understand it and be able to teach others. This approach takes a lot longer, as you need to make sure that everything you learn is accurate and you end up questioning the things that are not clear enough. Learning stuff to teach others is actually what you do in University; you’re learning things that will afterwards need to be explained in an exam to someone who knows more about the subject than you do.

Sure, both ways have their ups and downs, but if you want to write software you need to be able to switch between those two mindsets constantly. You’ll use the engineer way when reading API docs, looking for the bits and pieces that you need to build your stuff. You’ll use the ‘scientific’ way when you need to start learning a new technology, or when you need more detail on how to do things. While the API docs are taken care of by the library developers, it is the documentation team the one making sure that user guides, tutorials, and other developer resources are kept up to date, which are definitely some of the toughest and most important tasks done to help newcomers and other developers. So go on, go learn GNOME technologies and teach others, join the documentation team! ;)

GNOME Asia is not a usual conference. If you have attended a Desktop Summit, GUADEC or FOSDEM before, all those conferences are built by developers and for developers. The focus of those conferences is usually not (explicitly) to attract newcomers, but instead to be a show of the latest and shiniest things happening in the projects. Of course we also had part of that in Beijing, say Lennart’s talk about the status of systemd or Allan’s talk about application bundles. Those both were very good talks, but likely too specific for most of the audience. Instead, I chose to talk about something more basic, focused on attracting newcomers wanting to write applications, and so I gave an Introduction to D-Bus talk, including some examples. It is the same talk I gave last year in GUADEC-ES, but in English this time (my Mandarin is not good enough).

I would like to thank the GNOME Foundation for sponsoring the flight to Beijing, and of course to all the local team who, as I already said, did an amazing job.

[1] …except for the tea-less tea-breaks ;)

Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Meetings, Planets Tagged: dbus, gnome, gnome-asia

Recently Philip decided it was time to call for some attention.I happen to agree with him on the need to focus on developer experience, that's why I organized the first hackfest on this topic last year and attended this year. There are plenty of conversations around this and Philip, if you care so much, maybe you could attend or help, there's a lot to do and so few hands.

I've been asked to remove your blog by several people and I've reached the conclusion that it would be a really bad idea because it would set the wrong precedence and it would shift the discussion to the wrong topic (censorship yadda yadda). Questioning OPW should be allowed. The problem with your post is that if not questioned by other people (as many have done already) it would send the wrong message to the public and prospect GSoC, OPW and general contributors. Your blog was the wrong place to question and your wording makes it clear that you have misunderstandings about how the community works.

You want to make things better? Why don't you start by learning how to work with others and contributing yourself? You think we need better leadership? Why don't you learn what it takes to become a leader? (hint: your blog post doesn't help)

Perhaps your lack of contact with the overall project and your abscence from most events makes you not realize how possitive OPW has been, OPW has been a lot more successful than GSoC in retaining contributors and bringing diversity to our contributor base (and I don't mean gender diversity, but diversity in the nature of those contributions). I happen to have a pretty good picture of this because I get to manage the blogs of the people who stay and the people who leave. Without OPW GNOME would be worse community wise and project wise and this is not an opinion, it is a _hard data_ backed fact (other posts have enumerated the contributions that would have not happened otherwise so I will not do that here).

There are plenty of questions that I think are healthy to ask: for how long do we do OPW? Is its success only due to it being targetted to women or is it successful for something else? You should have a conversation with Marina and other people involved with OPW and gather an understanding before making assumptions and throwing assertions. And you should respect what people chooses to do within the project, it's their goddamn time after all. In open source no one gets to dictate what nobody does (though alignment is always good if can be achieved), people work in what they think its important and they try to do it together.

I think you should also watch this video, it might give you some understanding on why GNOME is as responsible for equality as any other entity.

 May 28, 2014

A common error when building from source is something like the error below:

configure: error: Package requirements (foo) were not met:No package 'foo' foundConsider adjusting the PKG_CONFIG_PATH environment variable if youinstalled software in a non-standard prefix.
Seeing that can be quite discouraging, but luckily, in many cases it's not too difficult to fix. As usual, there are many ways to get to a successful result, I'll describe what I consider the simplest.

# What does it mean?

pkg-config is a tool that provides compiler flags, library dependencies and a couple of other things to correctly link to external libraries. For more details on it see Dan Nicholson's guide. If a build system requires a package foo, pkg-config searches for a file foo.pc in the following directories: /usr/lib/pkgconfig, /usr/lib64/pkgconfig, /usr/share/pkgconfig, /usr/local/lib/pkgconfig, /usr/local/share/pkgconfig. The error message simply means pkg-config couldn't find the file and you need to install the matching package from your distribution or from source.

# What package provides the foo.pc file?

In many cases the package is the development version of the package name. Try foo-devel (Fedora, RHEL, SuSE, ...) or foo-dev (Debian, Ubuntu, ...). yum provides a great shortcut to install any pkg-config dependency:

$> yum install "pkgconfig(foo)" will automatically search and install the right package, including its dependencies. apt-get requires a bit more effort: $> apt-get install apt-file$> apt-file update$> apt-file search --package-only foo.pcfoo-dev$> apt-get install foo-dev For those running Arch and pacman, the sequence is: $> pacman -S pkgfile$> pkgfile -u$> pkgfile foo.pcextra/foo$> pacman -S extra/foo zypper is the same as yum: $> zypper in 'pkgconfig(foo)'
Once that's done you can re-run configure and see if all dependencies have been met. If more packages are missing, follow the same process for the next file.

Any users of other distributions - let me know how to do this on yours and I'll update the post

# Where does the dependency come from?

In most projects using autotools the dependency is specified in the file configure.ac and looks roughly like one of these:

PKG_CHECK_MODULES(FOO, [foo])PKG_CHECK_MODULES(DEPENDENCIES, foo [bar >= 1.4] banana)
The first argument is simple a name that is used in the build system, you can ingore it. After the comma is the list of space-separated dependencies. In this case this means we need foo.pc, bar.pc and banana.pc, and more specifically we need a bar.pc that is equal or newer to version 1.4 of the package. To install all three follow the above steps and you're good.

# My version is wrong!

It's not uncommon to see the following error after installing the right package:

configure: error: Package requirements (foo >= 1.9) were not met:Requested 'foo >= 1.9' but version of foo is 1.8Consider adjusting the PKG_CONFIG_PATH environment variable if youinstalled software in a non-standard prefix.
Now you're stuck and you have a problem. What this means is that the package version your distribution provides is not new enough to build your software. This is where the simple solutions and and it all gets a bit more complicated - with more potential errors. Unless you are willing to go into the deep end, I recommend moving on and accepting that you can't have the newest bits on an older distribution. Because now you have to build the dependencies from source and that may then require to build their dependencies from source and before you know you've built 30 packages. If you're willing read on, otherwise - sorry, you won't be able to run your software today.

# Manually installing dependencies

Now you're in the deep end, so be aware that you may see more complicated errors in the process. First of all you need to figure out where to get the source from. I'll now use cairo as example instead of foo so you see actual data. On rpm-based distributions like Fedora run:

$> yum info cairo-devel Loaded plugins: auto-update-debuginfo, langpacksSkipping unreadable repository '///etc/yum.repos.d/SpiderOak-stable.repo'Installed PackagesName : cairo-develArch : x86_64Version : 1.13.1Release : 0.1.git337ab1f.fc20Size : 2.4 MRepo : installedFrom repo : fedoraSummary : Development files for cairoURL : http://cairographics.orgLicense : LGPLv2 or MPLv1.1Description : Cairo is a 2D graphics library designed to provide high-quality : display and print output. : : This package contains libraries, header files and developer : documentation needed for developing software which uses the cairo : graphics library. The important field here is the URL line - got to that and you'll find the source tarballs. That should be true for most projects but you may need to google for the package name and hope. Search for the tarball with the right version number and download it. On Debian and related distributions, cairo is provided by the libcairo2-dev package. Run apt-cache show on that package: $> apt-cache show libcairo2-devPackage: libcairo2-devSource: cairoVersion: 1.12.2-3Installed-Size: 2766Maintainer: Dave Beckett Architecture: amd64Provides: libcairo-devDepends: libcairo2 (= 1.12.2-3), libcairo-gobject2 (= 1.12.2-3),[...]Suggests: libcairo2-docDescription-en: Development files for the Cairo 2D graphics library Cairo is a multi-platform library providing anti-aliased vector-based rendering for multiple target backends. . This package contains the development libraries, header files needed by programs that want to compile with Cairo.Homepage: http://cairographics.org/Description-md5: 07fe86d11452aa2efc887db335b46f58Tag: devel::library, role::devel-lib, uitoolkit::gtkSection: libdevelPriority: optionalFilename: pool/main/c/cairo/libcairo2-dev_1.12.2-3_amd64.debSize: 1160286MD5sum: e29852ae8e8e5510b00b13dbc201ce66SHA1: 2ed3534d02c01b8d10b13748c3a02820d10962cfSHA256: a6099cfbcc6bd891e347dd9abc57b7f137e0fd619deaff39606fd58f0cc60d27
In this case it's the Homepage line that matters, but the process of downloading tarballs is the same as above. For Arch users, the interesting line is URL as well:
$> pacman -Si cairo | grep URLRepository : extraName : cairoVersion : 1.12.16-1Description : Cairo vector graphics libraryArchitecture : x86_64URL : http://cairographics.org/Licenses : LGPL MPL.... zypper (Tizen, SailfishOS, Meego and others) doesn't have an interface for this, but you can run rpm on the package that you installed. $> rpm -qi cairo-develName        : cairo-devel[...]URL         : http://cairographics.org/
This command would obviously work on other rpm-based distributions too (Fedora, RHEL, ...). Unlike yum, it does require the package to be installed but by the time you get here you've already installed it anyway :)

Now to the complicated bit: In most cases, you shouldn't install the new version over the system version because you may break other things. You're better off installing the dependency into a custom folder ("prefix") and point pkg-config to it. So let's say you downloaded the cairo tarball, now you need to run:

$> mkdir$HOME/dependencies/$> tar xf cairo-someversion.tar.xz$> cd cairo-someversion$> autoreconf -ivf$> ./configure --prefix=$HOME/dependencies$> make && make install$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig# now go back to original project and run configure again So you create a directory called dependencies, install cairo there. This will install cairo.pc as$HOME/dependencies/lib/cairo.pc. Now all you need to do is tell pkg-config that you want it to look there as well - so you set PKG_CONFIG_PATH. If you re-run configure in the original project, pkg-config will find the new version and configure should succeed. If you have multiple packages that all require a newer version, install them into the same path and you only need to set PKG_CONFIG_PATH once. Remember you need to set PKG_CONFIG_PATH in the same shell as you are running configure from.

If you keep seeing the version error the most common problem is that PKG_CONFIG_PATH isn't set in your shell, or doesn't point to the new cairo.pc file. A simple way to check is:

# Retrospective

In retrospect, something that I didn't do the best way possible is probably to build a solid mailing list of people interested, and to build an important anticipation and incentive to buy the book at launch date. My mailing list counted around 1500 people subscribed because they were interested in the launch of the book subscribed; in the end, probably only 10-15% of them bought the book during the launch, which is probably a bit lower than what I could expect.

But more than a month later, I distributed in total almost 500 copies of the book (including physical units) for more than $10000, so I tend to think that this was a success. I still sell a few copies of the book each weeks, but the number are small compared to the launch. I sold less than 10 copies of the ebook using Bitcoins, and I admit I'm a bit disappointed and surprised about that. Physical copies represent 10% of the book distribution. It's probably a lot lower than most people that pushed me to do it thought it would be. But it is still higher of what I thought it would be. So I still would advise to have a paperback version of your book. At least because it's nice to have it in your library. I only got positive feedbacks, a few typo notices, and absolutely no refund demand, which I really find amazing. The good news is also that I've been contacted with a couple of Korean and Chinese editors to get the book translated and published in those countries. If everything goes well, the book should be translated in the upcoming months and be available on these markets in 2015! If you didn't get a copy, it's still time to do so!  May 03, 2014 It took some time, but now it’s finally done: KDE has translations for AppStream upstream metadata! AppStream is a Freedesktop project to extend metadata about the software projects which is available in distributions, especially regarding applications. Distributions compile a metadata file from data collected from packages, .desktop files and possibly other information sources, and create an AppStream XML file from it, which is then – directly or via a Xapian cache – read by software-center-like applications such as GNOME-Software or KDEs Apper. Since the metadata available from current sources is not standardized and rather poor, upstream projects can ship small XML files, AppStream upstream metadata or AppData in short. These files contain additional information about a project, such as a long description and links to screenshots. They also provide hints about public interfaces a software provides, for example binaries and libraries, making it possible for distributors to give users exactly the right package name in case they are missing a software component. So, in order to represent graphical KDE applications like they deserve it in the new software centers making use of AppStream, we need to ship AppData files, with long descriptions, screenshots and a few URLs. But how can you create these metadata files? In case you want your graphical KDE app to ship an AppData file, there is now a help page on the Techbase Wiki which provides all information needed to get started! For non-visual stuff or software which just wants to publish it’s provided interfaces with AppStream metadata, there is a dedicated page for that as well. Shipping metadata for non-GUI apps will help programmers to satisfy depedencies in order to compile new software, enhance bash-completion for missing binaries and provides some other neat stuff (take a look at this blogpost to get a taste of it). And if you want to read a FAQ about the metadata stuff and get the bigger picture, just go to the Techbase Wiki page about AppStream metadata as well. The pages are not 100% final, so if you have questions, please write me a mail and I’ll update the pages, or simply correct/refine it by yourself (it’s a wiki afterall). And now to the best thing: As soon as you ship an AppStream upstream metadata file (*.appdata.xml* for apps or *.metainfo.xml* for other stuff), the KDE l10n-script (Scripty!) will automatically start translating it, just like we already do with .desktop files. No further actions are necessary. I already have a large amount of metadata files here, partially auto-generated, which show that we have about 160+ applications in KDE which could get an AppData file, not counting any frameworks or other non-GUI stuff yet. Since that is a bit much to submit via Reviewboard (which I originally planned to do), I hope I can commit the changes directly to the respective repositories, where the maintainers can take a look at it and adjust it to their liking. If that idea does not receive approval, I will just publish a set of data at some place for the KDE app maintainers to take as a reference (the auto-generated stuff needs some fixup to be commit-ready (which I’d do in case I can just commit changes)). Either way, it is safe now to write and ship AppData files in KDE projects! In order to get your stuff translated, it is necessary that you follow the AppStream 0.6 metadata specification, and not one of ther older revisions. You can easily detect 0.6 metadata by the <component> root node, instead of <application>, or by it having a metadata_license tag. We don’t support the older versions simply because it’s not necessary, as there were only two KDE projects shipping AppData before, which are now using 0.6 data as well. Since 0.6, the metadata XML format is guaranteed to be stable, and the only reason which could make me change it in an incompatible way is to prevent something as bad as the end of the world from happening (== won’t happen ). You can find the full specification (upstream and distro data) here. All parsers are able to handle 0.6 data now, and the existing tools are almost all migrated already (might take a few months to hit the distributions though). So, happy metadata-writing! Thanks to all people who helped with making this happen, and especially Burkhard Lück and Albert Astals Cid for their patch-review and help with integrating the necessary changes into the KDE l10n-script.  May 01, 2014 I’m at the GNOME Development Experience Hackfest in Berlin, and one of the things that I wanted to target during these days was to keep on looking at how we can enable different profiles in Devhelp. As you probably know, Devhelp will show you the documentation of libraries installed in your system (usually only if you have the -devel or -docs package of the library installed). While this is already enough for most users, there is also the case where a developer wants to target a different version (older or newer) of the library than the one installed in the system. A typical case for this is developing applications using GNOME’s jhbuild infrastructure, targeted either to a given GNOME release or to git master of the involved modules. In this case, if you want to use new methods of let’s say GTK+, you usually end up needing to fire up a web browser and looking for the latest GTK+ documentation either in developer.gnome.org or in your jhbuild’s${prefix}/share/gtk-doc/html directory.

In order to avoid this, I’m prototyping some ideas to let the users switch between different profiles, e.g.:

• The ‘local’ profile, which is equivalent to what Devhelp currently shows.
• A user-defined ‘jhbuild’ profile, which could point to the install prefix of the jhbuild setup.
• Other user-defined profiles, which could point to other prefixes where the user has installed the newer (or older) libraries and their documentation.
• Profiles for each new GNOME release, e.g. 3.12, which could get downloaded from developer.gnome.org as a tarball containing all documentation for a given release.

The most challenging case is probably the last one, given that it would require some extra work in the website in order to make sure the documentation tarball is generated and published in every new release, plus of course client-side management of these downloaded profiles in Devhelp.

For now this is just a basic set of ideas, the final result may or may not be similar; we’re of course open to suggestions!

Filed under: Development, FreeDesktop Planet, GNOME Planet, GNU Planet, Planets Tagged: devhelp, documentation, gnome, gtk-doc
 April 30, 2014
10 years ago, I committed the first version of a browser plugin in Totem's source code tree. Today, it's going away.

The landscape of video on the Web changed, then changed back again, and web technologies have moved on. We've witnessed:

• The fall of RealPlayer
• The rise of Flash video players, as a way to turn videos into black boxes with minimal "copy protection" (cf. "YouTube downloader" in your favourite search engine)
• The rise and precipitous fall of Silverlight (with only a handful of websites, ever, or still, using it)
• And most importantly, the advent of HTML5's <video> tag
Totem's browser plugin did as good a job as it could mimicking legacy web browser plugins from other platforms, such as QuickTime or Windows Media Player (even we stopped caring about the RealPlayer mimicking).

It wasn't helped by the ill-defined Netscape Plugin APIs (NPAPI) which meant that we never knew whether we'd receive a stream for the video we were about to play, or maybe not at all, and when you request one, you'd get one automatic one and the one you requested, or whether it would download empty files. Or we couldn't tell to open in another application when clicking directly on a file. All in all, pretty dire.

We made attempts at replacing the Flash plugin for playing back videos, but the NPAPI meant that we needed to handle everything or nothing. Ideally, we'd have been able to tell the browser to use our browser plugin for websites that we could support through libquvi, and either fallback to a placeholder or the real Flash plugin for other cases. NPAPI didn't allow us to do that.

The current state of media playback in browsers on Linux is such that:
Given all this, and the facts that Totem's browser plugin will not work on Wayland (it uses XEmbed to slot into the browser UI), that its UI is pretty broken since the redesign of the main player (not unfixable, but time consuming), and that it does not work properly in GNOME's own web browser (due to bad interactions between Clutter and GL acceleration in WebKit), I think it's time to call it a day.

Good bye Totem browser plugin.

I'll miss the clever puns of your compatibility plugins (Real Player/Complex and QuickTime/NarrowSpace being the best ones). I won't miss interacting with ill-defined APIs and buggy implementations.

I've just updated the post about X.Org synaptics support for the Lenovo T440, T540, X240, Helix, Yoga, X1 Carbon. For those following my blog, here is a rough diff of the updates:

• All touchpads in this series need a kernel quirk to fix the min/max ranges. It's like a happy meal toy, make sure you collect them all.
• A new kernel evdev input prop INPUT_PROP_TOPBUTTONPAD is available in 3.15. It marks the devices that require top software buttons. It will be backported to stable.
• A new option was added HasSecondarySoftButtons was added to the synaptics driver. It is automatically set if INPUT_PROP_TOPBUTTONPAD is set and if set, the driver parses the SecondarySoftButtonAreas option and honours the values in it.
• If you have the kernel min/max fixes and the new property, don't bother with DMI matching. Provide a xorg.conf.d snippet that unconditionally merges the SecondarySoftButtonAreas and rely on the driver for parsing it when appropriate

This is a follow-up to my post from December Lenovo T440 touchpad button configuration. Except this time the support is real, or at least close to being finished. Since I am now seeing more and more hacks to get around all this I figured it's time for some info from the horse's mouth.

[update] I forgot to mention: synaptics 1.8 will have all these, the first snapshot is available here

Lenovo's newest series of laptops have a rather unusual touchpad. The trackstick does not have a set of physical buttons anymore. Instead, the top part of the touchpad serves as software-emulated buttons. In addition, the usual ClickPad-style software buttons are to be emulated on the bottom edge of the touchpad. An ASCII-art of that would look like this:

+----------------------------+| LLLLLLLLLL MMMMM RRRRRRRRR ||                            ||                            ||                            ||                            ||                            ||                            || LLLLLLLL          RRRRRRRR |+----------------------------+
Getting this to work required a fair bit of effort, patches to synaptics, the X server and the kernel and a fair bit of trial-and-error. Kudos for getting all this sorted goes to Hans the Goede, Benjamin Tissoires, Chandler Paul and Matthew Garrett. And in the process of fixing this we also fixed a bunch of other issues that have been plaguing clickpads for a while.

The first piece in the puzzle was to add a second software button area to the synaptics driver. Option "SecondarySoftButtonAreas" now allows a configuration in the same manner as the existing one (i.e. right and middle button). Any click in that software button area won't move the cursor, so the buttons will behave just like physical buttons. Option "HasSecondarySoftButtons" defines if that button area is to be used. Of course, we expect that button area to work out of the box, so we now ship configuration files that detect the touchpad and apply that automatically. Update 30 Apr: Originally we tried to get this done based on the PNPID or DMI matching but a better solution is the new INPUT_PROP_TOPBUTTONPAD evdev property bit. This is now applied to all these touchpads, and the synaptics driver uses this to enable the secondary software button area. This bit will be aviailable in kernel 3.15, with stable backports happening after that.

The second piece in the puzzle was to work around the touchpad firmware. The touchpads speak two protocols, RMI4 over SMBus and PS/2. Windows uses RMI4, Linux still uses PS/2. Apparently the firmware never got tested for PS/2 so the touchpad gives us bogus data for its axis ranges. A kernel fix for this is in the pipe. Update 30 Apr: every single touchpad of this generation needs a fix. They have been or are being merged.

Finally, the touchpad needed to be actually usable. So a bunch of patches that tweak the clickpad behaviours were merged in. If a finger is set down inside a software button area, finger movement does no longer affect the cursor. This stops the ever-so-slight but annoying movements when you execute a physical click on the touchpad. Also, there is a short timeout after a click to avoid cursor movement when the user just presses and releases the button. The timeout is short enough that if you do a click-and-hold for drag-and-drop, the cursor will move as expected. If a touch started outside a software button area, we can now use the whole touchpad for movement. And finally, a few fixes to avoid erroneous click events - we'd sometimes get the software button wrong if the event sequence is off.

Another change changed the behaviour of the touchpad when it is disabled through the "Synaptics Off" property. If you use syndaemon to disable the touchpad while typing, the buttons now work even when the touchpad is disabled. If you don't like touchpads at all and prefer to use the trackstick only, use Option "TouchpadOff" "1". This will disable everything but physical clicks on the touchpad.

On that note I'd also like to mention another touchpad bug that was fixed in the recent weeks: plenty of users reported synaptics having a finger stuck after suspend/resume or sometimes even after logging in. This was an elusive bug and finally tracked down to a mishandling of SYN_DROPPED events in synaptics 1.7 and libevdev. I won't provide a fix for synaptics 1.7 but we've fixed libevdev - please use synaptics 1.8 RC1 or later and libevdev 1.1 RC1 or later.

Update 30 Apr: If the INPUT_PROP_TOPBUTTONPAD is not available on your kernel, you can use DMI matching through udev rules. PNPID matching requires a new kernel patch as well, at which point you might as well rely on the INPUT_PROP_TOPBUTTONPAD property. An example for udev rules that we used in Fedora is below:

ATTR{[dmi/id]product_version}=="*T540*", ENV{ID_INPUT.tags}="top_softwarebutton_area"
and with the matching xorg.conf snippet:
Section "InputClass"        Identifier "Lenovo T540 trackstick software button buttons"        MatchTag "top_softwarebutton_area"        Option "HasSecondarySoftButtons" "on"        # If you dont have the kernel patches for your touchpad         # to fix the min/max ranges, you need to use absolute coordinates        # Option "SecondarySoftButtonAreas" "3363 0 0 2280 2717 3362 0 2280"        Option "SecondarySoftButtonAreas" "58% 0 0 8% 42% 58% 0 8%"EndSection
Update 30 Apr: For those touchpads that already have the kernel fix to adjust the min/max range, simply specifying the buttons in % of the touchpad dimensions is sufficient. For all other touchpads, you'll need to use absolute coordinates.

Fedora users: everything is being built in rawhide Update 30 Apr:, F20 and F19. The COPR listed in an earlier version of this post is not available anymore.

 April 29, 2014

When Solaris 11.1 came out in October 2012, I posted about the changes to the included FOSS packages. With the publication today of Solaris 11.2 beta, I thought it would be nice to revisit this and see what’s changed in the past year and a half. This time around, I’m including some bundled packages that aren’t necessarily covered by a free software or open source license, but are of interest to Solaris users.

Last time I discussed how IPS allowed us to make a variety of changes in update releases much more easily than in the Solaris 10 package system. One of these changes is obsoleting packages, and we’ve done that in a couple rare cases in both Solaris 11.1 and 11.2 where the software is abandoned by the upstream, and we’ve decided it would be worse to keep it around, potentially broken, than to remove it on upgrade.

When we do this, notices will be posted to the End of Features for Solaris 11 web page, alongside the list of features that have been declared deprecated and may be removed in future releases. As you can see there, in Solaris 11.1 the Adobe Flash Player and tavor HCA driver packages were removed.

In Solaris 11.2, three more packages have been removed. slocate was a “secure” version of the locate utility, which wouldn’t show a user any files that they didn’t have permission to access. Unfortunately, this utility was broken by changes in the AST library, and since there is no longer an upstream for it, we decided to follow the lead of several Linux distros and moved to mlocate instead, which is added in this release.

The other two removed packages are both Xorg video drivers - the nv driver for NVIDIA graphics, and the trident driver for old Trident graphics chipsets. Most users will not notice these removals, but if you had manually created an xorg.conf file specifying one of these drivers, you may need to edit it to use the vesa driver instead.

NVIDIA had previously supported the nv open source driver and contributed updates to X.Org to support new chipsets in it, but in 2010, they announced they would no longer do so, and considered nv deprecated, recommending the use of the VESA driver for those who had no better driver to use. While we had continued to ship the nv driver in Solaris, it led to an increasing number of crashes, hangs, and other bugs for which the resolution was to remove the nv driver and use vesa instead, so we are removing it to end those issues. For systems with graphics devices new enough to be supported by the bundled nvidia closed-source driver, this will have no effect. For those with older devices, this will cause Xorg autoconfiguration to load the vesa driver instead, until and unless the user downloads & installs an appropriate NVIDIA legacy driver.

The trident driver was still in Solaris even after we dropped 32-bit support on x86, and years after Trident Microsystems exited the graphics business and sold its graphics product line to XGI, as the Sun Fire V20z server included a Trident chipset for the console video device. Unfortunately, the upstream driver has been basically unmaintained since then, and Oracle has had to apply patches to port to new Xorg releases. Meanwhile, in order to resolve bugs that caused system hangs, the trident driver was modified to not load on V20z systems, which left us shipping an unmaintained driver solely for a system that could not use it, but uses the vesa driver instead, so we decided to remove it as well.

If you had either of these Xorg driver packages installed, then when you update to 11.2, then pkg update will inform you there are release notes for these drivers, to warn you of the possibility you may need to edit your xorg.conf.

## System Management Stack

The popular Puppet system for automating configuration changes across machines has been included in Solaris, and updated to support several Solaris features in both the framework and in individual configuration provideers. For instance, configuration changes made via Puppet will be recorded in the Solaris audit logs as part of a puppet session, and Puppet’s configuration file is generated from SMF properties using the new SMF stencil facilities. Providers are included that can configure IPS publishers, SMF properties, ZFS datasets, Solaris boot environments, and a variety of Solaris NIC, VNIC, and VLAN settings.

Another addition is the Oracle Hardware Management Pack (HMP), a set of tools that work with the ILOM, firmware, and other components in Sun/Oracle servers to configure low-level system options. Previously these needed to be downloaded and installed separately, now they are a simple pkg install away, and kept up to date with the rest of the OS.

A collaboration with Intel led to the integration of a Solaris port of Intel’s numatop tool for observing memory access locality across CPUs.

From the open source world, we’ve integrated several tools to allow admins and users to do multiple things at once, including the tmux terminal multiplexer, cssh tool for cluster administration via ssh, and GNU Parallel for running commands in parallel.

## Developer Stack

For developers, GNU Compiler Collection (gcc) versions 4.7 & 4.8 are added alongside the previous 3.4 & 4.5 packages, and the gcc packages have been refactored to better allow installing different subsets of compilers. Other updated developer tools include Mercurial 2.8.2, GNU emacs 24.3, pylint 0.25.2, and version 7.6 of the GNU debugger, gdb. Newly added tools for developers include GNU indent, JavaScript Lint, and Python’s pep8.

The Java 8 development kit & runtime environment are both available as well. The default installation clusters will only install Java 7, but you can install the Java 8 runtime with “pkg install jre-8” or get both the runtime & development kits with “pkg install jdk-8”. The /usr/java mediated link, through which all the links in /usr/bin for the java, jar, javac, etc. commands flow will be set by default to the most recent version installed, so installing Java 8 will make that version default. You can see this via “ls -l /usr/java” reporting:

lrwxrwxrwx   1 root   root     15 Apr 23 14:01 /usr/java -> jdk/jdk1.8.0_05

or via “pkg mediator java” reporting:
MEDIATOR     VER. SRC. VERSION IMPL. SRC. IMPLEMENTATION
java         system    1.8     system

If you want to choose a different version to be default, you can manually set the mediator to that version with “pkg set-mediator -V 1.7 java”. Of course, for many operations, you can directly access any installed java version via the full path, such as /usr/jdk/instances/jdk1.8.0/bin/java instead of relying on the /usr/bin symlinks.

One caveat to be aware of is that Java 8 for Solaris is only provided as 64-bit binaries, as all Solaris 11 and later machines are running 64-bit now. This means that any JNI modules you rely on will need to be compiled as 64-bit and any programs that try to load Java must be 64-bit. There is also no 64-bit version provided of either the Java plugin for web browsers, or the Java Webstart program for starting Java client applications from web pages.

## Desktop Stack

Most of the changes in the desktop stack in this release were updates needed to fix security issues, and are mostly covered on the Oracle Third Party Vulnerability Resolution Blog.

There were some feature updates in the X Window System layers of the desktop stack though – most notably the Xorg server was upgraded from 1.12 to 1.14, and the accompanying Mesa library was upgraded to version 9.0.3, which includes support for OpenGL 3.1 and GLSL 1.40 on Intel graphics. The bundled version of NVIDIA’s graphics driver was also updated, to NVIDIA’s latest “long lived branch” - 331. For users with older graphics cards which are no longer supported in this branch, legacy branches are available from NVIDIA’s Unix driver download site.

## OpenStack

And last, but certainly not least, especially in the number of packages added to the repository, is the addition of OpenStack support in Solaris. The Cinder Block Storage Service, Glance Image Service, Horizon Dashboard, Keystone Identity Service, Neutron Networking Service, and Nova Compute Service from the OpenStack Grizzly (2013.1) release are all provided, in versions tested and integrated with Solaris features. Between the Open Stack packages themselves and all the python modules required for them, there’s over 100 new FOSS packages in this release.

## Detailed list of changes

This table shows most of the changes to the bundled packages between the original Solaris 11.1 release, the latest Solaris 11.1 support repository update (SRU18, released April 14, 2014), and the Solaris 11.2 beta released today.

As with last time, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11.111.1 SRU1811.2 Beta
archiver/gnu-tarGNU tar1.261.261.27.1
archiver/unrarUnRAR4.1.44.1.44.2.4
cloud/openstack/cinderOpenStacknot includednot included0.2013.1.4
cloud/openstack/glanceOpenStacknot includednot included0.2013.1.4
cloud/openstack/horizonOpenStacknot includednot included0.2013.1.4
cloud/openstack/keystoneOpenStacknot includednot included0.2013.1.4
cloud/openstack/neutronOpenStacknot includednot included0.2013.1.4
cloud/openstack/novaOpenStacknot includednot included0.2013.1.4
communication/im/pidginpidgin2.10.52.10.52.10.9
compress/gzipGNU gzip1.41.51.5
compress/pbzip2Parallel bzip2not includednot included1.1.6
compress/pixzpixznot includednot included1.0
crypto/gnupgGnuPG2.0.172.0.172.0.22
database/berkeleydb-5Oracle Berkeley DB5.1.255.1.255.3.21
database/mysql-55MySQLnot includednot included5.5.31
database/sqlite-3SQLite3.7.113.7.14.13.7.14.1
desktop/window-manager/twmX.Org1.0.71.0.71.0.8
developer/build/antApache Ant1.7.11.8.41.8.4
developer/build/autoconf/xorg-macrosX.Org1.171.171.17.1
developer/build/imakeX.Org1.0.51.0.51.0.6
developer/build/makedependX.Org1.0.41.0.41.0.5
developer/debug/gdbGNU GDB6.86.87.6
developer/gcc-47GNU Compiler Collectionnot includednot included4.7.3
developer/gcc-48GNU Compiler Collectionnot includednot included4.8.2
developer/gnu-indentGNU indentnot includednot included2.2.9
developer/java/jdk-6Java1.6.0.351.6.0.751.6.0.75
developer/java/jdk-7Java1.7.0.71.7.0.55.131.7.0.55.13
developer/java/jdk-8Javanot includednot included1.8.0.5.13
developer/java/junitJUnit4.104.104.11
developer/javascript/jslJavaScript Lintnot includednot included0.3.0
developer/python/pylintpylint0.18.00.18.00.25.2
developer/versioning/mercurialMercurial SCM2.2.12.2.12.8.2
diagnostic/nmapnmap5.516.256.25
diagnostic/numatopnumatopnot includednot included1.0
diagnostic/scanpciX.Org0.13.10.13.10.13.2
diagnostic/tcpdumptcpdump4.1.14.5.14.5.1
diagnostic/wiresharkWireshark1.8.21.8.121.10.6
document/viewer/xditviewX.Org1.0.21.0.21.0.3
driver/graphics/nvidiaNVIDIA0.295.20.00.295.20.00.331.38.0
editor/gnu-emacsGNU Emacs23.423.424.3
editor/xeditX.Org1.2.01.2.01.2.1
file/gnu-coreutilsGNU Coreutils8.58.58.16
file/mcGNU Midnight Commander4.7.5.24.7.5.24.8.8
file/mlocatemlocatenot includednot included0.25
file/slocate3.13.1not included
image/editor/bitmapX.Org1.0.61.0.61.0.7
image/imagemagickImageMagick6.3.4.26.8.3.56.8.3.5
library/cacaoCommon Agent Container2.3.1.02.4.2.02.4.2.0
library/graphics/pixmanX.Org0.24.40.24.40.29.2
library/libarchivelibarchivenot includednot included3.0.4
library/libmilterSendmail8.14.58.14.78.14.7
library/libxml2XML C parser2.7.62.7.62.9.1
library/libxsltlibxslt1.1.261.1.261.1.28
library/neonneon0.29.50.29.50.29.6
library/perl-5/perl-x11-protocolCPAN: X11-Protocolnot includednot included0.56
library/perl-5/xml-libxmlCPAN: XML::LibXMLnot included2.142.14
library/perl-5/xml-namespacesupportCPAN: XML::NamespaceSupportnot included1.111.11
library/perl-5/xml-saxCPAN: XML::SAXnot included0.990.99
library/perl-5/xml-sax-baseCPAN: XML::SAX::Basenot included1.081.08
library/perl5/perl-tkCPAN: Tknot includednot included804.31
library/python-2/alembicalembicnot includednot included0.6.0
library/python-2/amqpamqpnot includednot included1.0.12
library/python-2/anyjsonanyjsonnot includednot included0.3.3
library/python-2/argparseargparsenot included1.2.11.2.1
library/python-2/babelbabelnot includednot included1.3
library/python-2/beautifulsoup4beautifulsoup4not includednot included4.2.1
library/python-2/botobotonot includednot included2.9.9
library/python-2/cheetahcheetahnot includednot included2.4.4
library/python-2/cliffcliffnot includednot included1.4.5
library/python-2/cmd2cmd2not includednot included0.6.7
library/python-2/cov-corecov-corenot includednot included1.7
library/python-2/cssutilscssutilsnot includednot included0.9.6
library/python-2/d2to1d2to1not includednot included0.2.10
library/python-2/decoratordecoratornot includednot included3.4.0
library/python-2/djangodjangonot includednot included1.4.10
library/python-2/django-appconfdjango-appconfnot includednot included0.6
library/python-2/django_compressordjango_compressornot includednot included1.3
library/python-2/django_openstack_authOpenStacknot includednot included1.1.3
library/python-2/eventleteventletnot includednot included0.13.0
library/python-2/filechunkiofilechunkionot includednot included1.5
library/python-2/formencodeformencodenot includednot included1.2.6
library/python-2/greenletgreenletnot includednot included0.4.1
library/python-2/httplib2httplib2not includednot included0.8
library/python-2/importlibimportlibnot includednot included1.0.2
library/python-2/ipythonipythonnot includednot included0.10
library/python-2/iso8601iso8601not includednot included0.1.4
library/python-2/jsonpatchjsonpatchnot includednot included1.1
library/python-2/jsonpointerjsonpointernot includednot included1.0
library/python-2/jsonschemajsonschemanot includednot included2.0.0
library/python-2/kombukombunot includednot included2.5.12
library/python-2/lesscpylesscpynot includednot included0.9.10
library/python-2/librabbitmqlibrabbitmqnot includednot included1.0.1
library/python-2/libxml2-26libxml22.7.62.7.62.9.1
library/python-2/libxml2-27libxml22.7.62.7.62.9.1
library/python-2/libxsl-26libxsl1.1.261.1.261.1.28
library/python-2/libxsl-27libxsl1.1.261.1.261.1.28
library/python-2/lockfilelockfilenot includednot included0.9.1
library/python-2/logilab-astnglogilab-astng0.19.00.19.00.24.0
library/python-2/logilab-commonlogilab-common0.40.00.40.00.58.2
library/python-2/markdownmarkdownnot includednot included2.3.1
library/python-2/markupsafemarkupsafenot includednot included0.18
library/python-2/mockmocknot includednot included1.0.1
library/python-2/netifacesnetifacesnot includednot included0.8
library/python-2/nosenose1.1.21.1.21.2.1
library/python-2/nose-cover3nose-cover3not includednot included0.0.4
library/python-2/ordereddictordereddictnot includednot included1.1
library/python-2/oslo.configoslo.confignot includednot included1.2.1
library/python-2/passlibpasslibnot includednot included1.6.1
library/python-2/pastepastenot includednot included1.7.5.1
library/python-2/paste.deploypaste.deploynot includednot included1.5.0
library/python-2/pbrpbrnot includednot included0.5.21
library/python-2/pep8pep8not includednot included1.4.4
library/python-2/pippipnot includednot included1.4.1
library/python-2/prettytableprettytablenot includednot included0.7.2
library/python-2/pypynot includednot included1.4.15
library/python-2/pyasn1pyasn1not includednot included0.1.7
library/python-2/pyasn1-modulespyasn1-modulesnot includednot included0.0.5
library/python-2/pycountrypycountrynot includednot included0.17
library/python-2/pydnspydnsnot includednot included2.3.6
library/python-2/pyflakespyflakesnot includednot included0.7.2
library/python-2/pygmentspygmentsnot includednot included1.6
library/python-2/pyopensslpyopenssl0.110.110.13
library/python-2/pyparsingpyparsingnot includednot included2.0.1
library/python-2/pyrabbitpyrabbitnot includednot included1.0.1
library/python-2/pytestpytestnot includednot included2.3.5
library/python-2/pytest-capturelogpytest-capturelognot includednot included0.7
library/python-2/pytest-codecheckerspytest-codecheckersnot includednot included0.2
library/python-2/pytest-covpytest-covnot includednot included1.6
library/python-2/python-dbus-26D-Bus0.83.20.83.21.1.1
library/python-2/python-imagingpython-imagingnot includednot included1.1.7
library/python-2/python-ldappython-ldapnot includednot included2.4.10
library/python-2/python-mysqlpython-mysqlnot includednot included1.2.2
library/python-2/python-zope-interfaceZopenot includednot included3.3.0
library/python-2/pytzpytznot includednot included2013.4
library/python-2/repoze.lrurepoze.lrunot includednot included0.6
library/python-2/requestsrequestsnot includednot included1.2.3
library/python-2/routesroutesnot includednot included1.13
library/python-2/setuptools-gitsetuptools-gitnot includednot included1.0
library/python-2/simplejsonsimplejsonnot includednot included2.1.2
library/python-2/sixsixnot includednot included1.4.1
library/python-2/sqlalchemysqlalchemynot includednot included0.7.9
library/python-2/sqlalchemy-migratesqlalchemy-migratenot includednot included0.7.2
library/python-2/stevedorestevedorenot includednot included0.10
library/python-2/sudssudsnot includednot included0.4
library/python-2/tempitatempitanot includednot included0.5.1
library/python-2/toxtoxnot includednot included1.4.3
library/python-2/unittest2unittest2not includednot included0.5.1
library/python-2/virtualenvvirtualenvnot includednot included1.9.1
library/python-2/waitresswaitressnot includednot included0.8.5
library/python-2/warlockwarlocknot includednot included1.0.1
library/python-2/webobwebobnot includednot included1.2.3
library/python-2/websockifywebsockifynot includednot included0.3.0
library/python-2/webtestWebTestnot includednot included2.0.6
library/python/cinderclientOpenStacknot includednot included1.0.7
library/python/glanceclientOpenStacknot includednot included0.12.0
library/python/keystoneclientOpenStacknot includednot included0.4.1
library/python/neutronclientOpenStacknot includednot included2.3.1
library/python/novaclientOpenStacknot includednot included2.15.0
library/python/quantumclientOpenStacknot includednot included2.2.4.3
library/python/swiftclientOpenStacknot includednot included2.0.2
library/security/libgpg-errorGnuPG1.101.121.12
library/security/opensslOpenSSL1.0.0.10 (1.0.0j)1.0.0.11 (1.0.0k)1.0.1.7 (1.0.1g)
library/security/openssl/openssl-fips-140OpenSSL1.21.22.0.6
mail/fetchmailfetchmail6.3.216.3.226.3.22
mail/thunderbirdMozilla Thunderbird10.0.61717.0.6
mail/thunderbird/plugin/thunderbird-lightningMozilla Lightning10.0.61717.0.6
media/cdrtoolsCDrecord3.03.03.1
network/amqp/rabbitmqRabbitMQnot includednot included3.1.3
network/dns/bindISC BIND9.6.3.7.2
(9.6-ESV-R7-P2)
9.6.3.10.2
(9.6-ESV-R10-P2)
9.6.3.10.2
(9.6-ESV-R10-P2)
network/rsyncrsync3.0.83.0.83.0.9
package/pkgbuildpkgbuild1.3.1041.3.1041.3.105
print/filter/hplipHPLIP3.10.93.10.93.12.4
runtime/clispGNU CLISP2.472.472.49
runtime/erlangErlang12.2.512.2.515.2.3
runtime/java/jre-6Java1.6.0.351.6.0.751.6.0.75
runtime/java/jre-7Java1.7.0.71.7.0.55.131.7.0.55.13
runtime/java/jre-8Javanot includednot included1.8.0.5.13
runtime/perl-512Perl5.12.45.12.55.12.5
runtime/ruby-18Ruby1.8.7.3571.8.7.3741.8.7.374
runtime/ruby-19Rubynot includednot included1.9.3.484
runtime/ruby-19/ruby-tkRubynot includednot included1.9.3.484
runtime/tcl-8Tcl/Tk8.5.98.5.98.5.12
runtime/tcl-8/tcl-sqlite-33.7.113.7.14.13.7.14.1
runtime/tk-8Tcl/Tk8.5.98.5.98.5.12
security/compliance/openscapOpenSCAP0.8.10.8.11.0.0
security/sudoSudo1.8.4.51.8.6.71.8.6.7
service/memcachedMemcached1.4.51.4.171.4.17
service/network/dhcp/isc-dhcpISC DHCP4.1.0.64.1.0.74.1.0.7
service/network/dns/bindISC BIND9.6.3.7.2 (9.6-ESV-R7-P2)9.6.3.10.2 (9.6-ESV-R10-P2)9.6.3.10.2 (9.6-ESV-R10-P2)
service/network/dnsmasqDnsmasqnot includednot included2.68
service/network/ftpProFTPD1.3.3.0.7 (1.3.3g)1.3.4.0.3 (1.3.4c)1.3.4.0.3 (1.3.4c)
service/network/ntpNTP4.2.5.200 (4.2.5p200)4.2.7.381 (4.2.7p381)4.2.7.381 (4.2.7p381)
service/network/ptpPTPdnot includednot included2.2.0
service/network/sambaSamba3.6.63.6.233.6.23
service/network/smtp/sendmailSendmail8.14.58.14.78.14.7
service/security/stunnelstunnel4.294.294.56
shell/gnu-getoptGNU getoptnot includednot included1.1.5
shell/parallelGNU parallelnot includednot included0.2012.11.22
shell/tcshtcsh6.17.06.18.16.18.1
shell/zshZsh4.3.174.3.175.0.5
system/library/dbusD-Bus1.2.281.2.281.7.1
system/library/freetype-2FreeType2.4.92.4.112.4.11
system/library/hmp-libsHMPnot includednot included2.2.8
system/library/libdbusD-Bus1.2.281.2.281.7.1
system/library/libdbus-glibD-Bus0.880.880.100
system/library/libpcaptcpdump1.1.11.5.11.5.1
system/library/security/libgcryptGNU libgcrypt1.4.51.5.31.5.3
system/management/biosconfigHMPnot includednot included2.2.8
system/management/facterPuppetnot includednot included1.6.18
system/management/fwupdateHMPnot includednot included2.2.8
system/management/fwupdate/emulexHMPnot includednot included6.3.12.2
system/management/fwupdate/qlogicHMPnot includednot included1.7.3
system/management/hmp-snmpHMPnot includednot included2.2.8
system/management/hwmgmtcliHMPnot includednot included2.2.8
system/management/hwmgmtdHMPnot includednot included2.2.8
system/management/ipmitoolipmitool1.8.111.8.111.8.12
system/management/puppetPuppetnot includednot included3.4.1
system/management/raidconfigHMPnot includednot included2.2.8
system/management/ubiosconfigHMPnot includednot included2.2.8
system/storage/sg3_utilssg3_utils1.281.281.33
system/test/sunvts7.0.147.17.17.18.0
terminal/csshCluster SSHnot includednot included4.2.1
terminal/tmuxtmuxnot includednot included1.8
text/gnu-grepGNU grep2.102.142.14
text/texinfoGNU texinfo4.74.134.13
web/browser/firefoxMozilla Firefox10.0.61717.0.6
web/java-servlet/tomcatApache Tomcat6.0.356.0.376.0.39
web/php-53PHP5.3.145.3.275.3.28
web/php-53/extension/php-zendopcacheZend OPcachenot includednot included7.0.2
web/proxy/squidsquid3.1.183.1.233.1.23
web/server/apache-22Apache HTTPD2.2.222.2.252.2.27
web/server/apache-22/module/apache-fcgidApache FastCGI2.3.62.3.92.3.9
web/server/apache-22/module/apache-php53PHP5.3.145.3.275.3.28
web/server/apache-22/module/apache-securityModSecurity2.5.92.5.92.7.5
web/server/apache-22/module/apache-sedApache HTTPD2.2.222.2.222.2.27
web/server/lighttpd-14Lighttpd1.4.231.4.331.4.35
web/wgetGNU wget1.121.121.14
x11/data/xcursor-themesX.Org1.0.31.0.31.0.4
x11/demo/mesa-demosMesa 3-D8.0.18.0.18.1.0
x11/diagnostic/intel-gpu-toolsX.Orgnot includednot included1.3
x11/diagnostic/xevX.Org1.2.01.2.01.2.1
x11/diagnostic/xscopeX.Org1.3.11.3.11.4
x11/library/libdmxX.Org1.1.21.1.21.1.3
x11/library/libdrmDRI2.4.322.4.322.4.43
x11/library/libfontencX.Org1.1.11.1.11.1.2
x11/library/libfsX.Org1.0.41.0.41.0.5
x11/library/libsmX.Org1.2.11.2.11.2.2
x11/library/libx11X.Org1.5.01.5.01.6.2
x11/library/libxauX.Org1.0.71.0.71.0.8
x11/library/libxcbXCB1.8.11.8.11.9.1
x11/library/libxcompositeX.Org0.4.30.4.30.4.4
x11/library/libxcursorX.Org1.1.131.1.131.1.14
x11/library/libxdamageX.Org1.1.31.1.31.1.4
x11/library/libxextX.Org1.3.11.3.11.3.2
x11/library/libxfixesX.Org5.05.05.0.1
x11/library/libxfontX.Org1.4.51.4.51.4.7
x11/library/libxiX.Org1.6.11.6.11.7.2
x11/library/libxineramaX.Org1.1.21.1.21.1.3
x11/library/libxmuX.Org1.1.11.1.11.1.2
x11/library/libxmuuX.Org1.1.11.1.11.1.2
x11/library/libxpX.Org1.0.11.0.11.0.2
x11/library/libxpmX.Org3.5.103.5.103.5.11
x11/library/libxrandrX.Org1.3.21.3.21.4.2
x11/library/libxrenderX.Org0.9.70.9.70.9.8
x11/library/libxresX.Org1.0.61.0.61.0.7
x11/library/libxtstX.Org1.2.11.2.11.2.2
x11/library/libxvX.Org1.0.71.0.71.0.10
x11/library/libxvmcX.Org1.0.71.0.71.0.8
x11/library/libxxf86vmX.Org1.1.21.1.21.1.3
x11/library/mesaMesa 3-D7.11.27.11.29.0.3
x11/library/toolkit/libxaw7X.Org1.0.111.0.111.0.12
x11/library/toolkit/libxtX.Org1.1.31.1.31.1.4
x11/library/xcb-utilXCB0.3.80.3.80.3.9
x11/server/xorgX.Org1.12.21.12.21.14.5
x11/server/xorg/driver/xorg-input-keyboardX.Org1.6.11.6.11.7.0
x11/server/xorg/driver/xorg-input-mouseX.Org1.7.21.7.21.9.0
x11/server/xorg/driver/xorg-input-synapticsX.Org1.6.21.6.21.7.1
x11/server/xorg/driver/xorg-input-vmmouseX.Org12.8.012.8.013.0.0
x11/server/xorg/driver/xorg-video-astX.Org0.93.100.93.100.97.0
x11/server/xorg/driver/xorg-video-atiX.Org6.14.46.14.46.14.6
x11/server/xorg/driver/xorg-video-cirrusX.Org1.4.01.4.01.5.2
x11/server/xorg/driver/xorg-video-dummyX.Org0.3.50.3.50.3.6
x11/server/xorg/driver/xorg-video-intelX.Org2.18.02.18.02.21.5
x11/server/xorg/driver/xorg-video-mach64X.Org6.9.16.9.16.9.4
x11/server/xorg/driver/xorg-video-mgaX.Org1.5.01.5.01.6.2
x11/server/xorg/driver/xorg-video-nvX.Org2.1.182.1.18not included
x11/server/xorg/driver/xorg-video-openchromeX.Org0.2.9050.2.9050.3.2
x11/server/xorg/driver/xorg-video-r128X.Org6.8.26.8.26.8.4
x11/server/xorg/driver/xorg-video-tridentX.Org1.3.51.3.5not included
x11/server/xorg/driver/xorg-video-vesaX.Org2.3.12.3.12.3.2
x11/server/xorg/driver/xorg-video-vmwareX.Org12.0.212.0.213.0.1
x11/session/sessregX.Org1.0.71.0.71.0.8
x11/session/xinitX.Org1.3.21.3.21.3.3
x11/transsetX.Org1.0.01.0.01.0.1
x11/x11-window-dumpX.Org1.0.51.0.51.0.6
x11/xcalcX.Org1.0.4.11.0.4.11.0.5
x11/xclipboardX.Org1.1.21.1.21.1.3
x11/xclockX.Org1.0.61.0.61.0.7
x11/xconsoleX.Org1.0.41.0.41.0.6
x11/xfdX.Org1.1.11.1.11.1.2
x11/xfontselX.Org1.0.41.0.41.0.5
x11/xfsX.Org1.1.21.1.31.1.3
x11/xkillX.Org1.0.31.0.31.0.4
x11/xmagX.Org1.0.41.0.41.0.5
x11/xmanX.Org1.1.21.1.21.1.3
x11/xvidtuneX.Org1.0.21.0.21.0.3
 April 27, 2014

## Current Glamor Performance

I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package.

This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs.

### Changes in X11perf

First off, I did some analysis of what x11perf was doing and found that it wasn’t quite measuring what we thought. I’m not interested in competing on x11perf numbers absolutely, I’m only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me.

When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn’t reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles.

The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result.

EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It’s a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines.

I “fixed” the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart.

I’ve pushed out these changes to my x11perf repository on freedesktop.org.

## What’s Fast

Things that match GL’s capabilities are fast, things which don’t are slow. No surprises there. What’s interesting is precisely what matches GL

Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations.

### GL Lines

Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case.

The other implicit requirement is that all zero width lines look the same. Right now, I’ve solved that for fill styles and raster ops as they’re all drawn with the same GL operations. However, for plane masks, we’re currently falling back to software, which may draw something different. Fixing that isn’t impossible, it’s just tedious.

### Text

Pushing all of the work of drawing core text into glamor wasn’t terribly difficult, and the results are pretty spectacular.

## What’s Slow

We’ve still got room for improvement in Glamor, but there aren’t any obvious show-stoppers to getting great performance for reasonable X applications anymore.

### Wide Lines and Arcs

One of the speed-ups I’ve made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn’t do this, and so we suffer from submitting many smaller requests to GL. It’s hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days.

### Filled Polygons

Because X only lets applications draw a single polygon in each request, Glamor can’t really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren’t that useful.

### Render Operations

These are still not structured to best fit modern GL; some work here would help a bunch. We’ve got a gsoc student ready to go at this though, so I expect we’ll have much better numbers in a few months.

### Window Operations

You wouldn’t think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here.

## Unexpected Results

Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you’d expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles.

GL semantics for copying data essentially preclude overlapping blts of any form. There’s the NVtexturebarrier extension which at least lets us do blts within the same object, but even that doesn’t define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster.

## Results

Here’s a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.

 April 25, 2014

Today I released AppStream and libappstream 0.6.1, which feature mostly bugfixes, so nothing incredibly exciting to see there (but this also means no API/ABI breaks). The release clarifies some paragraphs in the spec which people found confusing, and fixes a few issues (like one example in the docs not being valid AppStream metadata). As only spec extension, we introduce a “priority” property in distro metadata to allow metadata from one repository to override data shipped by another one. This is used (although with a similar syntax) in Fedora already to have “placeholder” data for non-free stuff, which gets overridden by the real metadata if a new application was added. In general, the property tag was added to make the answer to the question “which data is preferred” much less magic.

The libappstream library got some new API to query component data in different ways, and I also brought back support for Vala (so if you missed the Vapi file: It’s back now, although you have to manually enable this feature).

The CLI tool also got some extensions to query AppStream data. Here is a brief introduction:

First of all, we need to make sure the database is up-to-date, which should be the case already (it is rebuilt automatically):

$sudo appstream-index refresh The database will only be rebuilt when necessary, if you want to force a rebuild anyway, use the “–force” parameter. Now imagine we want to search for an app containing the word “media” (in description, keywords, summary, …): $ appstream-index s media

which will return:

Identifier: gnome-media-player.desktop [desktop-app] Name: GNOME Media Player Summary: A simple media player for GNOME Package: gnome-media-player ---- Identifier: mediaplayer-app.desktop [desktop-app] Name: Media Player Summary: Media Player Package: mediaplayer-app ---- Identifier: kde4__plasma-mediacenter.desktop [desktop-app] Name: Plasma Media Center Summary: A mediacenter user interface written with the Plasma framework Package: plasma-mediacenter ---- etc.

If we already know the name if a .desktop-file or the ID of a component, we can have the tool print out information about the application, including which package it was installed from:

$appstream-index get lyx.desktop If we want to see more details, including e.g. a screenshot URL and a longer description, we can pass “–details” to the tool: Identifier: lyx.desktop [desktop-app] Name: LyX Summary: An advanced document processor with the power of LaTeX. Package: lyx-common Homepage: http://www.lyx.org/ Icon: lyx.png Description: LyX is a document processor that encourages an approach to writing based on the structure of your documents (WYSIWYM) and not simply their appearance (WYSIWYG). LyX combines the power and flexibility of TeX/LaTeX[...] Sample Screenshot URL: http://alt.fedoraproject.org/pub/alt/screenshots/f21/source/lyx-ea535ddf18b5c7328c5e88d2cd2cbd8c.png License: GPLv2+ (I truncated the results slightly ) Okay, so far so good. But now it gets really exciting (and this is a feature added with 0.6.1): We can now query a component by the items it provides. For example, I want to know which software provides the library libfoo.2: appstream-index what-provides lib libfoo.so.2 This also works with binaries, or Python modules: appstream-index what-provides bin apper This stuff works distribution-agnostic, and as long as software ships upstream metadata with a valid <provides/> field, or the distributor adds it while generating AppStream distro metadata. This means that software can – as soon as we have sufficient metadata of this kind – declare it’s dependencies upstream in form of a simple text file, referencing the needed components to build and run it on any Linux distribution. Users can simply install missing stuff by passing that file to their package manager, which can look up the components->packaging mapping and versions and do the right thing in installing the dependencies. So basically, this allows things “pip -r” does for Python, but for any application (not only Python stuff), and based on the distributors package database. With the provides-items, we can also scan software to detect it’s dependencies automatically (and have it in a distro-agnostic form directly). We can also easily search for missing mimetype-handlers, missing kernel-modules, missing firmware etc. to install it on-demand, making the system much smarter in handling it’s dependencies. And users don’t need to do search orgies to find the right component for a given task. Also on my todo list for the future, based on this feature: A small tool telling upstream authors which distribution has their application in which version, using just one command (and AppStream data from multiple distros). Also planned: A cross-distro information page showing which distros ship which library versions, Python modules and application versions (and also the support status of the distro), so developers know which library versions (or GCC versions etc.) they should at least support to make their application easily available on most distributions. As always, you can get the releases on Freedesktop, as well es the AppStream specification.  April 17, 2014 Under that name is a simple idea: making it easier to save, load, update and query objects in an object store. I'm not the main developer for this piece of code, but contributed a large number of fixes to it, while porting a piece of code to it as a test of the API. Much of the credit for the design of this very useful library goes to Christian Hergert. The problem It's possible that you've already implemented a data store inside your application, hiding your complicated SQL queries in a separate file because they contain injection security issues. Or you've used the filesystem as the store and threw away the ability to search particular fields without loading everything in memory first. Given that SQLite pretty much matches our use case - it offers good search performance, it's a popular thus well-documented project and its files can be manipulated through a number of first-party and third-party tools - wrapping its API to make it easier to use is probably the right solution. The GOM solution GOM is a GObject based wrapper around SQLite. It will hide SQL from you, but still allow you to call to it if you have a specific query you want to run. It will also make sure that SQLite queries don't block your main thread, which is pretty useful indeed for UI applications. For each table, you would have a GObject, a subclass of GomResource, representing a row in that table. Each column is a property on the object. To add a new item to the table, you would simply do: item = g_object_new (ITEM_TYPE_RESOURCE, "column1", value1, "column2", value2, NULL);gom_resource_save_sync (item, NULL); We have a number of features which try to make it as easy as possible for application developers to use gom, such as: • Automatic table creation for string, string arrays, and number types as well as GDateTime, and transformation support for complex types (say, colours or images). • Automatic database version migration, using annotations on the properties ("new in version") • Programmatic API for queries, including deferred fetches for results Currently, the main net gain in terms of lines of code, when porting SQLite, is the verbosity of declaring properties with GObject. That will hopefully be fixed by the GProperty work planned for the next GLib release. The future I'm currently working on some missing features to support a port of the grilo bookmarks plugin (support for column REFERENCES). I will also be making (small) changes to the API to allow changing the backend from SQLite to a another one, such as XML, or a binary format. Obviously the SQL "escape hatches" wouldn't be available with those backends. Don't hesitate to file bugs if there are any problems with the API, or its documentation, especially with respect to porting from applications already using SQLite directly. Or if there are bugs (surely, no). Note that JavaScript support isn't ready yet, due to limitations in gjs. ¹: « SQLite don't hurt me, don't hurt me, no more »  April 16, 2014 Things are moving forward for the Fedora Workstation project. For those of you who don’t know about it, it is part of a broader plan to refocus Fedora around 3 core products with clear and distinctive usecase for each. The goal here is to be able to have a clear definition of what Fedora is and have something that for instance ISVs can clearly identify and target with their products. At the same time it is trying to move away from the traditional distribution model, a model where you primarily take whatever comes your way from upstream, apply a little duct tape to try to keep things together and ship it. That model was good in the early years of Linux existence, but it does not seem a great fit for what people want from an operating system today. If we look at successful products MacOS X, Playstation 4, Android and ChromeOS the red thread between them is that while they all was built on top of existing open source efforts, they didn’t just indiscriminately shovel in any open source code and project they could find, instead they decided upon the product they wanted to make and then cherry picked the pieces out there that could help them with that, developing what they couldn’t find perfect fits for themselves. The same is to some degree true for things like Red Hat Enterprise Linux and Ubuntu. Both products, while based almost solely on existing open source components, have cherry picked what they wanted and then developed what pieces they needed on top of them. For instance for Red Hat Enterprise Linux its custom kernel has always been part of the value add offered, a linux kernel with a core set of dependable APIs. Fedora on the other hand has historically followed a path more akin to Debian with a ‘more the merrier’ attitude, trying to welcome anything into the group. A metaphor often used in the Fedora community to describe this state was that Fedora was like a collection of Lego blocks. So if you had the time and the interest you could build almost anything with it. The problem with this state was that the products you built also ended up feeling like the creations you make with a random box of lego blocks. A lot of pointy edges and some weird looking sections due to needing to solve some of the issues with the pieces you had available as opposed to the piece most suited. With the 3 products we are switching to a model where although we start with that big box of lego blocks we add some engineering capacity on top of it, make some clear and hard decisions on direction, and actually start creating something that looks and feels like it was made to be a whole instead of just assembled from a random set of pieces. So when we are planning the Fedora Workstation we are not just looking at what features we can develop for individual libraries or applications like GTK+, Firefox or LibreOffice, but we are looking at what we want the system as a whole to look like. And maybe most important we try our hardest to look at things from a feature/usecase viewpoint first as opposed to a specific technology viewpoint. So instead of asking ‘what features are there in systemd that we can expose/use in the desktop being our question, the question instead becomes ‘what new features do we want to offer our users in future versions of the product, and what do we need from systemd, the kernel and others to be able to do that’. So while technologies such as systemd, Wayland, docker, btrfs are on our roadmap, they are not there because they are ‘cool technologies’, they are there because they provide us with the infrastructure we need to achieve our feature goals. And whats more we make sure to work closely with the core developers to make the technologies what we need them to be. This means for example that between myself and other members of the team we are having regular conversations with people such as Kristian Høgsberg and Lennart Poettering, and of course contributing code where possible. To explain our mindset with the Fedora Workstation effort let me quickly summarize some old history. In 2001 Jim Gettys, one of the original creators of the X Window System did at talk a GUADEC in Sevile called ‘Draining the Swamp’. I don’t think the talk can be found online anywhere, but he outlined some of the same thoughts in this email reply to Richard Stallman some time later. I think that presentation has shaped the thinking of the people who saw it ever since, I know it has shaped mine. Jim’s core message was that the idea that we can create a great desktop system by trying to work around the shortcomings or weirdness in the rest of the operating system was a total fallacy. If we look at the operating system as a collection of 100% independent parts, all developing at their own pace and with their own agendas, we will never be able to create a truly great user experience on the desktop. Instead we need to work across the stack, fixing the issues we see where they should be fixed, and through that ‘drain the swamp’. Because if we continued to try to solve the problems by adding layers upon layers of workarounds and abstraction layers we would instead be growing the swamp, making it even more unmanageable. We are trying to bring that ‘draining the swamp’ mindset with us into creating the Fedora Workstation product. With that in mind what is the driving ideas behind the Fedora Workstation? The Fedora Workstation effort is meant to provide a first class desktop for your laptop or workstation computer, combining a polished user interface with access to new technologies. We are putting a special emphasis on developers with our first releases, both looking at how we improve the desktop experience for developers, and looking at what tools we can offer to developers to let them be productive as quickly as possible. And to be clear when we say developers we are not only thinking about developers who wants to develop for the desktop or the desktop itself, but any kind of software developer or DevOPs out there. The full description of the Fedora Workstation can be found here, but the essence of our plan is to create a desktop system that not only provides some incremental improvements over how things are done today, but which tries truly take a fresh look at how a linux desktop operating system should operate. The traditional distribution model, built up around software packages like RPM or Deb has both its pluses and minuses. Its biggest challenge is probably that it creates a series of fiefdoms where a 3rd party developers can’t easily target the system or a family of systems except through spending time very specifically supporting each one. And even once a developers decides to commit to trying to support a given system it is not clear what system services they can depend on always being available or what human interface design they should aim for. Solving these kind of issues is part of our agenda for the new workstation. So to achieve this we have decided on a set of core technologies to build this solution upon. The central piece of the puzzle is the so called LinuxApps proposal from Lennart Poettering. LinuxApps is currently a combination of high level ideas and some concrete building blocks. In terms of the building blocks are technologies such as Wayland, kdbus, overlayfs and software containers. The ideas side include developing a permission system similar to what you for instance see Android applications employ to decide what rights a given application has and develop defined versioned library bundles that 3rd party applications can depend on regardless of the version of the operating system. On the container side we plan on expanding on the work Red Hat is doing with Docker and Project Atomic. In terms of some of the other building blocks I think most of you already know of the big push we are doing to get the new Wayland display server ready. This includes work on developing core infrastructure like libinput, a new library for handling input devices being developed by Jonas Ådahl and our own Peter Hutterer. There is also a lot of work happening on the GNOME 3 side of things to make GNOME 3 Wayland ready. Jasper St.Pierre wrote up a great blog blog entry outlining his work to make GDM and the GNOME Shell work better with Wayland. It is an ongoing effort, but there is a big community around this effort as most recently seen at the West Cost Hackfest at the Endless Mobile office. As I mentioned there is a special emphasis on developers for the initial releases. These includes both a set of small and big changes. For instance we decided to put some time into improving the GNOME terminal application as we know it is a crucial piece of technology for a lot of developers and system administers alike. Some of the terminal improvements can be seen in GNOME 3.12, but we have more features lined up for the terminal, including the return of translucency. But we are also looking at the tools provided in general and the great thing here is that we are able to build upon a lot of efforts that Red Hat is developing for the Red Hat product portfolio, like Software Collections which gives easy access to a wide range of development tools and environments. Together with Developers Assistant this should greatly enhance your developers experience in the Fedora Workstation. The inclusion of Software collections also means that Fedora becomes an even better tool than before for developing software that you expect to deploy on RHEL, as you can be sure that an identical software collection will be available on RHEL that you have been developing against on Fedora as software collections ensure that you can have the exact same toolchain and toolchain versions available for both systems. Of course creating a great operating system isn’t just about the applications and shell, but also about supporting the kind of hardware people want to use. A good example here is that we put a lot of effort into HiDPI support. HiDPI screens are not very common yet, but a lot of the new high end laptops coming out are using them already. Anyone who has used something like a Google Pixel or a Samsung Ativ Book 9 Plus has quickly come to appreciate the improved sharpness and image quality these displays brings. Due to the effort we put in there I have been very pleased to see many GNOME 3.12 reviews mentioning this work recently and saying that GNOME 3.12 is currently the best linux desktop for use with HiDPI systems due to it. Another part of the puzzle for creating a better operating system is the software installation. The traditional distribution model often tended to try to bundle as many applications as possible as there was no good way for users to discover new software for their system. This is a brute force approach that assumes that if you checked the ‘scientific researcher’ checkbox you want to install a random collection of 100 applications useful for ‘scientific researchers’. To me this is a symptom of a system that does not provide a good way of finding and installing new applications. Thanks to the ardent efforts of Richard Hughes we have a new Software Installer that keeps going from strength to strength. It was originally launched in Fedora 19, but as we move forward towards the first Fedora Workstation release we are enabling new features and adding polish to it. One area where we need to wider Fedora community to work with us is to increase the coverage of appdata files. Appdata files essentially contains the necessary metadata for the installer to describe and advertise the application in question, including descriptive text and screenshots. Ideally upstreams should come with their own appdata file, but in the case where they are not we should add them to the Fedora package directly. Currently applications from the GTK+ and GNOME sphere has relatively decent appdata coverage, but we need more effort into getting applications using other toolkits covered too. Which brings me to another item of importance to the workstation. The linux community has for natural reasons been very technical in nature which has meant that some things that on other operating systems are not even a question has become defining traits on Linux. The choice of GUI development toolkits being one of these. It has been a great tool used by the open source community to shoot ourselves in the foot for many years now. So while users of Windows or MacOS X probably never ask themselves what toolkit was used to implement a given application, it seems to be a frequently asked one for linux applications. So we want to move away from it with the Workstation. So while we do ship the GNOME Shell as our interface and use GTK+ for developing tools ourselves, including spending time evolving the toolkit itself that does not mean we think applications written using for instance Qt, EFL or Java are evil and should be exorcised from the system. In fact if an application developer want to write an application for the linux desktop at all we greatly appreciate that effort regardless of what tools they decide to use to do so. The choice of development toolkits is a choice meant to empower developers, not create meaningless distinctions for the end user. So one effort we have underway is to work on the necessary theming and other glue code to make sure that if you run a Qt application under the GNOME Shell it feels like it belongs there, which also extends to if you are using accessibility related setups like the high contrast theme. We hope to expand upon that effort both in width and in depth going forward. And maybe on a somewhat related note we are also trying to address the elephant in the room when it comes to the desktop and that is the fact that the importance of the traditional desktop is decreasing in favor of the web. A lot of things that you used to do locally on your computer you are probably just doing online these days. And a lot of the new things you have started doing on your computer or other internet capable device are actually web services as opposed to a local applications. The old Sun slogan of ‘The Network is the Computer’ is more true today than it has ever been before. So we don’t believe the desktop is dead in any way or form, as some of the hipsters in the media like to claim, in fact we expect it to stay around for a long time. What we do envision though is that the amount of time you spend on webapps will continue to grow and that more and more of your computing tasks will be done using web services as opposed to local applications. Which is why we are continuing to deeply integrate the web into your desktop. Be that through things like GNOME Online accounts or the new Webapps that are introduced in Software installer. And as I have mentioned before on this blog we are also still working on trying to improve the integration of Chrome and Firefox apps into the desktop along the same lines. So while we want the desktop to help you use the applications you want to run locally as efficiently as possible, we also realize that you like us are living in a connected world and thus we need to help give you get easy access to your online life to stay relevant. So there are of course a lot of other parts of the Fedora Workstation effort, but this has already turned into a very long blog post as it is so I leave the rest for later. Please feel free to post any questions or comments and I will try to respond.  April 14, 2014 The 2014 "Journées du Logiciel Libre" took place in Lyon like (almost) every year this past week-end. It's a francophone free software event over 2 days with talks, and plenty of exhibitors from local Free Software organisations. I made the 600 metres trip to the venue, and helped man the GNOME booth with Frédéric Peters and Alexandre Franke's moustache. Our demo computer was running GNOME 3.12, using Fedora 20 plus the GNOME 3.12 COPR repository which was working pretty well, bar some teething problems. We kept the great GNOME 3.12 video running in Videos, showcasing the video websites integration, and regularly demo'd new applications to passers-by. The majority of people we talked to were pretty impressed by the path GNOME has taken since GNOME 3.0 was released: the common design patterns across applications, the iterative nature of the various UI elements, the hardware integration or even the online services integration. The stand-out changes for users were the Maps application which, though a bit bare bones still, impressed users, and the redesigned Videos. We also spent time with a couple of users dispelling myths about "lightness" of certain desktop environments or the "heaviness" of GNOME. We're constantly working on reducing resource usage in GNOME, be it sluggishness due to the way certain components work (with the applications binary cache), memory usage (cf. the recent gjs improvements), or battery usage (cf. my wake-up reduction posts). The use of gnome-shell using tablet-grade hardware for desktop machines shows that we can offer a good user experience on hardware that's not top-of-the-line. Our booth was opposite the ones from our good friends from Ubuntu and Fedora, and we routinely pointed to either of those booths for people that were interested in running the latest GNOME 3.12, whether using the Fedora COPR repository or Ubuntu GNOME. We found a couple of bugs during demos, and promptly filed them in Bugzilla, or fixed them directly. In the future, we might want to run a stable branch version of GNOME Continuous to get fixes for embarrassing bugs quickly (such as a crash when enabling Zoom in gnome-shell which made an accessibility enthusiast tut at us). GNOME and Rhône Until next year in sunny Lyon. (and thanks Alexandre for the photos in this article!)  April 06, 2014 When I started to write the landing page for The Hacker's Guide to Python, I wanted to try new things at the same time. I read about A/B testing a while ago, and I figured it was a good opportunity to test it out. # A/B testing If you do not know what A/B testing is about, take a quick look at the Wikipedia page on that subject. Long story short, the idea is to serve two different version of a page to your visitors and check which one is getting the most success. When you found which version is better, you can definitely switch to it. In the case of my book, I used that technique on the pre-launch page where people were able to subscribe to the newsletter. I didn't have a lot of things I wanted to test out on that page, so I just used that approach on the subtitle, being either "Learn everything you need to build a successful Python project" or "It's time you make the most out of Python". Statistically, each version would be served half of the time, so both would get the same number of view. I then would build statistics about which page was attracting the most subscribers. With the results I would be able to switch definitively to that version of the landing page. # Technical design My Web site, this Web site, is entirely static and served by Apache httpd. I didn't want to use any dynamic page, language or whatever. Mainly because I didn't want to have something else to install and maintain just for that on my server. It turns out that Apache httpd is powerful enough to implement such a feature. There are different ways to build it, and I'm going to describe my choices here. The first thing to pick is a way to balance the display of the page. You need to find a way so that if you get 100 visitors, around 50 will see the version A of your page, and around 50 will see the version B of the page. You could use a random number generator, pick a random number for each visitor, and decides which page he's going to see. But it turns out that I didn't find a way to do that with Apache httpd at first sight. My second thought was to pick the client IP address. But it's not such a good idea, because if you got visitors from, for example, people behind a company firewall, they are all going to be served the same page, so that kind of kills the statistics. Finally, I picked time based balancing: if you visit the page on a second that is even, you get version A of the page, and if you visit the page on a second that is odd, you get version B. Simple, and so far nothing proves there are more visitors on even than odd seconds, or vice-versa. The next thing is to always serve the same page to a returning visitor. I mean that if the visitor comes back later and get a different version, that's cheating. I decided the system should always serve the same page once a visitor "picked" a version. To do that, it's simple enough, you just have to use cookies to store the page the visitor has been attributed, and then use that cookie if he comes back. # Implementation To do that in Apache httpd, I used the powerful mod_rewrite that is shipped with it. I put 2 files in the books directory, named either "the-hacker-guide-to-python-a.html" and "the-hacker-guide-to-python-b.html" that got served when you requested "/books/the-hacker-guide-to-python". RewriteEngine OnRewriteBase /books # If there's a cookie called thgtp-pre-version set,# use its value and serve the pageRewriteCond %{HTTP_COOKIE} thgtp-pre-version=([^;])RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-%1.html [L] # No cookie yet and…RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+)# … the number of seconds of the time right now is evenRewriteCond %{TIME_SEC} [02468]$# Then serve the page A and store "a" in a cookieRewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-a.html [cookie=thgtp-pre-version:a:julien.danjou.info,L] # No cookie yet and…RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+)# … the number of seconds of the time right now is oddRewriteCond %{TIME_SEC} [13579]$# Then serve the page B and store "b" in a cookieRewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-b.html [cookie=thgtp-pre-version:b:julien.danjou.info,L]

With that few lines, it worked flawlessly.

# Results

The results were very good, as it worked perfectly. Combined with Google Analytics, I was able to follow the score of each page. It turns out that testing this particular little piece of content of the page was, as expected, really useless. The final score didn't allow to pick any winner. Which also kind of proves that the system worked perfectly.

But it still was an interesting challenge!

 April 03, 2014
During the wee hours of the morning, David Faure posted a new mime applications specification which will allow to setup per-desktop default applications, for example, watching films in GNOME Videos in GNOME, but DragonPlayer in KDE. Up until now, this was implemented differently in at least KDE and GNOME, even to the point that GTK+ applications would use the GNOME default when running on a KDE desktop, and vice-versa.

This is made possible using XDG_CURRENT_DESKTOP as implemented in gdm by Lars. This environment variable will also allow implementing a more flexible OnlyShowIn and NotShowIn desktop entry fields (especially for desktops like Unity implemented on top of GNOME, or GNOME Classic implemented on top of GNOME) and desktop-specific GSettings/dconf configurations (again, very useful for GNOME Classic). The environment variable supports applying custom configuration in sequence (first GNOME Classic then GNOME in that example).

Today, Ryan and David discussed the desktop file cache, making it faster to access desktop file data without hitting scattered files. The partial implementation used a custom structure, but, after many kdbus discussions earlier in the week, Ryan came up with a format based on serialised GVariant, the same format as kdbus messages (but implementable without implementing a full GVariant parser).

We also spent quite a bit of time writing out requirements for a filesystem notification to support some of the unloved desktop use cases. Those use cases are currently not supported by either inotify and fanotify.

That will end our face-to-face meeting. Ryan and David led a Lunch'n'Learn in the SUSE offices to engineers excited about better application integration in the desktops irrespective of toolkits.

Many thanks to SUSE for the accommodation as well as hosting the meeting in sunny Nürnberg. Special thanks to Ludwig Nussel for the morning biscuits :)