planet.freedesktop.org
July 30, 2014
or why publishing code is STEP ZERO.

If you've been developing code internally for a kernel contribution, you've probably got a lot of reasons not to default to working in the open from the start, you probably don't work for Red Hat or other companies with default to open policies, or perhaps you are scared of the scary kernel community, and want to present a polished gem.

If your company is a pain with legal reviews etc, you have probably spent/wasted months of engineering time on internal reviews and stuff, so think all of this matters later, because why wouldn't it, you just spent (wasted) a lot of time on it, so it must matter.

So you have your polished codebase, why wouldn't those kernel maintainers love to merge it.

Then you publish the source code.

Oh, look you just left your house. The merging of your code is many many miles distant and you just started walking that road, just now, not when you started writing it, not when you started legal review, not when you rewrote it internally the 4th time. You just did it this moment.

You might have to rewrite it externally 6 times, you might never get it merged, it might be something your competitors are also working on, and the kernel maintainers would rather you cooperated with people your management would lose their minds over, that is the kernel development process.

step zero: publish the code. leave the house.

(lately I've been seeing this problem more and more, so I decided to write it up, and it really isn't directed at anyone in particular, I think a lot of vendors are guilty of this).
July 25, 2014
Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:
  • Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
  • Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
  • Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
  • Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.
July 23, 2014
DRI3 has plenty of necessary fixes for X.org and Wayland, but it's still young in its integration. It's been integrated in the upcoming Fedora 21, and recently in Arch as well.

If WebKitGTK+ applications hang or become unusably slow when an HTML5 video is supposed to be, you might be hitting this bug.

If Totem crashes on startup, it's likely this problem, reported against cogl for now.

Feel free to add a comment if you see other bugs related to DRI3, or have more information about those.

Update: Wayland is already perfect, and doesn't use DRI3. The "DRI2" structures in Mesa are just that, structures. With Wayland, the DRI2 protocol isn't actually used.
I've just pushed the vc4-sim-validate branch to my Mesa tree. It's the culmination of the last week's worth pondering and false starts since I got my first texture sampling in simulation last Wednesday.

Handling texturing on vc4 safely is a pain. The pointer to texture contents doesn't appear in the normal command stream, and instead it's in the uniform stream. Which uniform happens to contain the pointer depends on how many uniforms have been loaded by the time you get to the QPU_W_TMU[01]_[STRB] writes. Since there's no iommu, I can't trust userspace to tell me where the uniform is, otherwise I'd be allowing them to just lie and put in physical addresses and read arbitrary system memory.

This meant I had to write a shader parser for the kernel, have that spit out a collection of references to texture samples, switch the uniform data from living in BOs in the user -> kernel ABI and instead be passed in as normal system memory that gets copied to the temporary exec bo, and then do relocations on that.

Instead of trying to write this in the kernel, with a ~10 minute turnaround time per test run, I copied my kernel code into Mesa with a little bit of wrapper code to give a kernel-like API environment, and did my development on that. When I'm looking at possibly 100s of iterations to get all the validation code working, it was well worth the day spent to build that infrastructure so that I could get my testing turnaround time down to about 15 sec.

I haven't done actual validation to make sure that the texture samples don't access outside of the bounds of the texture yet (though I at least have the infrastructure necessary now), just like I haven't done that validation for so many other pointers (vertex fetch, tile load/stores, etc.). I also need to copy the code back out to the kernel driver, and it really deserves some cleanups to add sanity to the many different addresses involved (unvalidated vaddr, validated vaddr, and validated paddr of the data for each of render, bin, shader recs, uniforms). But hopefully once I do that, I can soon start bringing up glamor on the Pi (though I've got some major issue with tile allocation BO memory management before anything's stable on the Pi).
July 22, 2014

Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).

Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:
STATE_BASE_ADDRESS

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin’ it Happen

64b

As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

“This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 5122 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
        return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
        return native_pte_val(pte) & PTE_FLAGS_MASK;
}

static inline int pte_present(pte_t a)
{
        return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
                               _PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
}
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

  • PML4[0-255], each point to a PDP
  • PDP[0-255][0-511], each point to a PD
  • PD[0-255][0-511][0-511], each point to a PT
  • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0×200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\  34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\  34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
  256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\  67240192entries \times \frac{1bit}{1entry} = 67240192bits \\  67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\  4294967296bytes + 8405024bytes = 4303372320bytes \\  4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
	...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
	...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
	...
}

/**
 * alloc_page_tables - Allocate all page tables for the given virtual address.
 *
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 *
 * @ppgtt:	The PPGTT structure encapsulating the virtual address space.
 * @address:	The virtual address for which we want page tables.
 *
 */
static void
alloc_page_tables(ppgtt, unsigned long address)
{
	struct i915_pagetab *pt;
	struct i915_pagedir *pd;
	struct i915_pagedirpo *pdp;
	struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

	int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
	int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
	int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
	int pte = (address & I915_PDES_PER_PD);

	if (!test_bit(pml4e, pml4->used_pml4es))
		goto alloc;

	pdp = pml4->pagedirpo[pml4e];
	if (!test_bit(pdpe, pdp->used_pdpes;))
		goto alloc;

	pd = pdp->pagedirs[pdpe];
	if (!test_bit(pde, pd->used_pdes)
		goto alloc;

	pt = pd->page_tables[pde];
	if (test_bit(pte, pt->used_ptes))
		return;

alloc_pdp:
	pdp = alloc_one_pdp(pml4, pml4e);
	set_bit(pml4e, pml4->used_pml4es);
alloc_pd:
	pd = alloc_one_pd(pdp, pdpe);
	set_bit(pdpe, pdp->used_pdpes);
alloc_pt:
	pt = alloc_one_pt(pd, pde);
	set_bit(pde, pd->used_pdes);
}

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

Bitmaps tracking page tables

The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
	__u64 user_ptr;
	__u64 user_size;
	__u32 ctx_id;
	__u32 flags;
#define I915_USERPTR_READ_ONLY          (1<<0)
#define I915_USERPTR_GPU_MIRROR         (1<<1)
#define I915_USERPTR_UNSYNCHRONIZED     (1<<31)
	/**
	 * Returned handle for the object.
	 *
	 * Object handles are nonzero.
	 */
	__u32 handle;
	__u32 pad;
};

The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

  • Pros:
    • This interface is very simple.
    • Existing userptr code does the hard work for us
  • Cons:
    • You need 1 IOCTL per object. Much undeeded overhead.
    • It’s subject to a lot of problems userptr has5
    • Userptr was already merged, so unless pad get’s repruposed, we’re screwed

What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I’ve created. Feel free to do with them as you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/legacy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/mirrored.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/table_hierarchy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/addr-bitmap.svg

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)  

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 

  4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 

  5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring 

July 21, 2014

Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren’t terrible, this isn’t as useful as it once was, and because this uses Glamor in a weird way, we’re making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding “_uxa” to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new ‘intel_uxa.h” file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
    int ok = 0;

    if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
        ok = glamor_fill_spans_nf(pDrawable,
                      pGC, n, ppt, pwidth, fSorted);
        uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
    }

    if (!ok)
        goto fallback;

    return;
}

Pulling those out shrank the UXA code by quite a bit.

Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an ‘accel’ variable which could be one of three options:

  • ACCEL_GLAMOR.
  • ACCEL_UXA.
  • ACCEL_NONE

I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we’ve got lots of places that look like:

        switch (intel->accel) {
#if USE_GLAMOR
        case ACCEL_GLAMOR:
                if (!intel_glamor_create_screen_resources(screen))
                        return FALSE;
                break;
#endif
#if USE_UXA
        case ACCEL_UXA:
                if (!intel_uxa_create_screen_resources(screen))
                        return FALSE;
        break;
#endif
        case ACCEL_NONE:
                if (!intel_none_create_screen_resources(screen))
                        return FALSE;
                break;
        }

Using a switch means that we can easily elide code that isn’t wanted in a particular build. Of course ‘accel’ is an enum, so places which are missing one of the required paths will cause a compiler warning.

It’s not all perfectly clean yet; there are piles of UXA-only paths still.

Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef’s in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
    -----------             ------          -------
    none                      7143            73039
    glamor                    7397            76540
    uxa                      25979           283777
    sna                     118832          1303904

    none legacy              14449           152480
    glamor legacy            14703           156125
    uxa legacy               33285           350685
    sna legacy              126138          1395231

The ‘legacy’ addition supports i810-class hardware, which is needed for a complete driver.

Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling (!) because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven’t analyzed what effect this has on performance.

Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It’s generally the fastest way of updating the screen as you don’t have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that’s pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,
                               pixmap,
                               &stride,
                               &size);


bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
    close(fd);
    intel_glamor_get_pixmap(pixmap)->bo = bo;

That last bit remembers the bo in some local memory so we don’t have to do this more than once for each pixmap. glamorfdfrompixmap ends up calling eglCreateImageKHR followed by gbmbo_import and then a kernel ioctl to convert a prime handle into an fd. It’s all quite round-about, but it does seem to work just fine.

After I’d gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn’t working. That turned out to have an unexpected consequence — all full-screen applications would run flat-out, and not be limited to frame rate. Present ‘recovers’ from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

Painting Root before Mode set

The X server has generally done initialization in one order:

  1. Create root pixmap
  2. Set video modes
  3. Paint root window

Recently, we’ve added a ‘-background none’ option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can’t use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

  1. Create root pixmap
  2. Paint root window
  3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we’re not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy — just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge — I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn’t call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.

Performance

I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor’s image functions, and the UXA GetImage is pretty slow. Using Mesa’s image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

       1                 2                 Operation
------------   -------------------------   -------------------------
     50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square 
     12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square 
      1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square 
      3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square 
        36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     49800.0        50200.0 (     1.008)   GetImage 10x10 square 
      5690.0        19300.0 (     3.392)   GetImage 100x100 square 
       609.0         1360.0 (     2.233)   GetImage 500x500 square 
      3100.0          206.0 (     0.066)   GetImage XY 10x10 square 
        36.4          183.0 (     5.027)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

       1                 2                 Operation
------------   -------------------------   -------------------------
     43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square 
      2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square 
       130.0         4250.0 (    32.692)   ShmGetImage 500x500 square 
      3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square 
        36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     41700.0        50200.0 (     1.204)   GetImage 10x10 square 
      2520.0        19300.0 (     7.659)   GetImage 100x100 square 
       125.0         1360.0 (    10.880)   GetImage 500x500 square 
      3150.0          206.0 (     0.065)   GetImage XY 10x10 square 
        36.1          183.0 (     5.069)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Of course, this is all just x11perf, which doesn’t represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it’s nice to have this kind of speed up.

Status

I’m running this on my crash box to get some performance numbers and continue testing it. I’ll switch my desktop over when I feel a bit more comfortable with how it’s working. But, I think it’s feature complete at this point.

Where’s the Code

As usual, the code is in my personal repository. It’s on the ‘glamor’ branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor
July 19, 2014

Hello,

As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won’t be present in my implementation. That’s why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.


July 17, 2014

Two years ago, I got appointed as chairman of the openSUSE Board. I was very excited about this opportunity, especially as it allowed me to keep contributing to openSUSE, after having moved to work on the cloud a few months before. I remember how I wanted to find new ways to participate in the project, and this was just a fantastic match for this. I had been on the GNOME Foundation board for a long time, so I knew it would not be easy and always fun, but I also knew I would pretty much enjoy it. And I did.

Fast-forward to today: I'm still deeply caring about the project and I'm still excited about what we do in the openSUSE board. However, some happy event to come in a couple of months means that I'll have much less time to dedicate to openSUSE (and other projects). Therefore I decided a couple of months ago that I would step down before the end of the summer, after we'd have prepared the plan for the transition. Not an easy decision, but the right one, I feel.

And here we are now, with the official news out: I'm no longer the chairman :-) (See also this thread) Of course I'll still stay around and contribute to openSUSE, no worry about that! But as mentioned above, I'll have less time for that as offline life will be more "busy".

openSUSE Board Chairman at oSC14

openSUSE Board Chairman at oSC14

Since I mentioned that we were working on a transition... First, knowing the current board, I have no doubt everything will be kept pushed in the right direction. But on top of that, my good friend Richard Brown has been appointed as the new chairman. Richard knows the project pretty well and he has been on the board for some time now, so is aware of everything that's going on. I've been able to watch his passion for the project, and that's why I'm 100% confident that he will rock!

Anandtech recently went all out on the ARM midgard architecture (Mali T series). This was quite astounding, as ARM MPD tends to be a pretty closed shop. The Anandtech coverage included an in-depth view of the Mali Midgard GPU, a (short) Q&A session with Jem Davies (the head honcho of ARM MPD, ARMs Media Processing Division, the part of ARM that develops the Mali and display and video engines) and a google hangout with Jem Davies a week later.

This set of articles does not seem like the sort of thing that ARM MPD would have initiated itself. Since both Imagination Technologies and NVidia did something similar months earlier, my feeling is that this was either initiated by Anand Lal Shimpi himself, or that this was requested by ARM marketing in response to the other articles.

Several interesting observations can be made from this though, especially from the answers (or sometimes, lack thereof) to the Q&A and google hangout sessions.

Hiding behind Linaro.

First off, Mr Davies still does not see an open source driver as a worthwhile endeavour for ARM MPD, and this is a position that hasn't changed since i started the lima driver, when my former employer went and talked to ARM management. Rumour has it that most of ARMs engineers both in MPD and other departments would like this to be different, and that Mr Davies is mostly alone with his views, but that's currently just hearsay. He himself states that there are only business reasons against an open source driver for the mali.

To give some weight to this, Mr Davies stated that he contributed to the linux kernel, and i called him out on that one, as i couldn't find any mention of him in a kernel git tree. It seems however that his contributions are from during the Bitkeeper days, and that the author trail on those changes probably got lost. But, having contributed to a project at one point or another is, to me, not proof that one actively supports the idea of open source software, at best it proves that adding support to the kernel for a given ARM device or subsystem was simply necessary at one point.

Mr Davies also talked about how ARM is investing a lot in linaro, as a proof of ARMs support of open source software. Linaro is a consortium to further linux on ARM, so per definition ARM plays a very big role in it. But it is not ARM MPD that drives linaro, it is ARM itself. So this is not proof of ARM MPD actively supporting open source software. Mr Davies did not claim differently, but this distinction should be made very clear in this context.

Then, linaro can be described as an industry consortium. For non-founding members of a consortium, such a construction is often used to park some less useful people while gaining the priviledge to claim involvement as and when desired. The difference to other consortiums is that most of the members come from a deeply embedded background, where the word "open" never was spoken before, and, magically, simply by having joined linaro, those deeply embedded companies now feel like they succesfully ticked the "open source" box on their marketing checklist. Several of linaros members are still having severe difficulty conforming to the GPL, but they still proudly wear the linaro badge as proof of their open source...ness?

As a prominent member of the sunxi community, I am most familiar with Allwinner, a small chinese cheap SoC designer. At the start of the year, we were seeing some solid signs of Allwinner opening up to our community directly. In March however, Allwinner joined linaro and people were hopeful that this meant that a new era of openness had started for Allwinner. As usual, I was the only cynical voice and i warned that this could mean that Allwinner now wouldn't see the need to further engage with us. Ever since, we haven't been able to reach our contacts inside Allwinner anymore, and even our requests for compliance with the GPL get ignored.

Linaro membership does not absolve from limited open source involvement or downright license violation, but for many members, this is exactly how it is used. Linaro seems to be a get-out-of-jail-free card for several of its members. Linaro membership does not need to prove anything, linaro membership even seems to have the opposite effect in several cases.

ARM driving linaro is simply no proof that ARM MPD supports open source software.

The patent excuse.

I am amazed that people still attempt to use this as an argument against open source graphics drivers.

Usually this is combined with the claim that open source drivers are exposing too much of the inner workings of the hardware. But this logic in itself states that the hardware is the problem, not the software. The hardware itself might or might not have patent issues, and it is just a matter of time before the owner of said breached patents will come a-knocking. At best, an open source driver might speed up the discovery of said issues, but the driver itself never is the cause, as the problems will have been there all along.

One would actually think that the Anandtech article about the midgard architecture would reveal more about the hardware, and trigger more litigation, than the lima driver could ever do, especially given how neatly packaged an in depth anandtech article is. Yet ARM MPD seemed to have had no issue with exposing this much information in their marketing blitz.

I also do not believe that patents are such a big issue. If graphics hardware patents were such big business, you would expect that an industry expert in graphics, especially one who is a dab hand at reverse engineering, would be contacted all the time to help expose potential patent issues. Yet i never have been contacted, and i know of no-one who ever has been.

Similarly. the first bits of lima code were made available 2.5 years ago, with bits trickling out slowly (much to my regret), and there are still several unknowns today. If lima played any role in patent disputes, you would again expect that i would be asked to support those looking to assert their patents. Again, nothing.

GPU Patents are just an excuse, nothing more.

When I was at SuSE, we freed ATI for AMD, and we never did hear that excuse. AMD wanted a solid open source strategy for ATI as ATI was not playing ball after the merger, and the bad publicity was hurting server (CPU) sales. Once the decision was made to go the open source route, patents suddenly were not an issue anymore. We did however have to deal with IP issues (or actually, AMD did - we made very sure we didn't get anything that wasn't supposed to be free), such as HDCP and media decoding, which ATI was not at liberty to make public. Given the very heated war that ATI and Nvidia fought at the time, and the huge amount of revenue in this market, you would think that ATI would be a very likely candidate for patent litigation, yet this never stood in the way of an open source driver.

There is another reason as to why patents are that popular an excuse. The words "troll" and "legal wrangling" are often sprinkled around as well so that images of shady deals being made by lawyers in smokey backrooms usually come to mind. Yet we never get to hear the details of patent cases, as even Mr Davies himself states that ARM is not making details available of ongoing cases. I also do not know of any public details on cases that have been closed already (not that i have actively looked - feel free to enlighten me). Patents are a perfect blanket excuse where proof apparently does not seem to be required.

We open source developers are very much aware of the damage that software patents do, and this makes the patent weapon perfect for deployment against those who support open source software. But there is a difference between software patents and the patent cases that ARM potentially has to deal with on the Mali. Yet we seem to have made patents our own kryptonite, and are way too easily lulled into backing off at the first mention of the word patent.

Patents are a poor excuse, as there is no direct relationship between an open source driver and the patent litigation around the hardware.

The Resources discussion.

As a hardware vendor (or IP provider) doing a free software driver never is for free. A lot of developer time does need to be invested, and this is an ongoing commitment. So yes, a viable open source driver for the Mali will consume some amount of resources.

Mr Davies states that MPD would have to incur this cost on its own, as MPD seems to be a completely separate unit and that further investment can only come from profit made within this group. In light of that information, I must apologize for ever having treated ARM and ARM MPD as one and the same with respect to this topic. I will from now on make it very clear that it is ARM MPD, and ARM MPD alone, that doesn't want an open source mali driver.

I do believe that Mr Davies his cost versus gain calculations are too direct and do not allow for secondary effects.

I also believe that an ongoing refusal to support an open source strategy for Mali will reflect badly on the sale of ARM processors and other IP, especially with ARM now pushing into the server market and getting into intel territory. The actions of ARM MPD do affect ARM itself, and vice versa. Admittedly, not as much as some with those that more closely tie the in-house GPU with the rest of the system, but that's far from an absolute lack of shared dependency and responsibility.

The Mali binary problem.

One person in the Q&A section asked why ARM isn't doing redistributable drivers like Nvidia does for the Tegra. Mr Davies answered that this was a good idea, and that linaro was doing something along those lines.

Today, ironically, I am the canonical source for mali-400 binaries. At the sunxi project, we got some binaries from the Cubietech people, built from code they received from allwinner, and the legal terms they were under did not prevent them from releasing the built binaries to the public. Around them (or at least, using the binaries as a separate git module) I built a small make based installation system which integrates with ARMs open source memory manager (UMP) and even included a quick GLES test from the lima tests. I stopped just short of debian packaging. The sunxi-mali repository, and the wiki tutorial that goes with it, now is used by many other projects (like for instance linux-rockchip) as their canonical source for (halfway usable) GPU support.

There are several severe problems with these binaries, which we have either fixed directly, have been working around or just have to live with. Direct fixes include adding missing library dependencies, and hollowing out a destructor function which made X complain. These are binary hacks. The xf86-video-fbturbo driver from Siarhei Siamashka works around the broken DRI2 buffer management, but it has to try to autodetect how to work around the issues, as it is differently broken on the different versions of the X11 binaries we have. Then there is the flaky coverage, as we only have binaries for a handful of kernel APIs, making it impossible to match them against all vendor provided SoC/device kernels. We also only have binaries for fbdev or X11, and sometimes for android, mostly for armhf, but not always... It's just one big mess, only slightly better than having nothing at all.

Much to our surprise, in oktober of last year, ARM MPD published a howto entry about setting up a working driver for mali midgard on the chromebook. It was a step in the right direction, but involved quite a bit off faff, and Connor Abbott (the brilliant teenager REing the mali shaders) had to go and pour things into a proper git repository so that it would be more immediately useful. Another bout of insane irony, as this laudable step in the right direction by ARM MPD ultimately left something to be desired.

ARM MPD is not like ATI, Nvidia, or even intel, qualcomm or broadcom. The Mali is built into many very different SoC families, and needs to be integrated with different display engines, 2D engines, media engines and memory/cache subsystems.

Even the distribution of drivers is different. From what i understand, mali drivers are handled as follows. The Mali licensees get the relevant and/or latest mali driver source code and access to some support from ARM MPD. The device makers, however, only rarely get their hands on source code themselves and usually have to make do with the binaries provided by the SoC vendor. Similarly, the device maker only rarely gets to deal with ARM MPD directly, and usually needs to deal with some proxy at the SoC vendor. This setup puts the responsibility of SoC integration squarely at the SoC vendor, and is well suited for the current mobile market: one image per device at release, and then almost no updates. But that market is changing with the likes of Cyanogenmod, and other markets are opening or are actively being opened by ARM, and those require a completely different mode of operation.

There is gap in Mali driver support that ARM MPDs model of driver delivery does not cater for today, and ARM MPD knows about this. But MPD is going to be fighting an upbill battle to try to correct this properly.

Binary solutions?

So how can ARM MPD try to tackle this problem?

Would ARM MPD keep the burden of making suitable binaries available solely with SoC vendors or device makers? Not likely as that is a pretty shakey affair that's actively hurting the mali ecosystem. SoCs for the mobile market have incredibly short lives, and SoC and device software support is so fragmented that these vendors would be responsible for backporting bugfixes to a very wide array of kernels and SoC versions. On top of that, those vendors would only support a limited subset of windowing systems, possibly even only android as this is their primary market. Then, they would have to set up the support infrastructure to appropriately deal with user queries and bug reports. Only very few vendors will end up even attempting to do this, and none are doing so today. In the end, any improvement at this end will bring no advantages to the mali brand or ARM MPD. If this path is kept, we will not move on from the abysmal situation we are in today, and the Mali will remain to be seen as a very fragmented product.

ARM MPD has little other option but to try to tackle this itself, directly, and it should do so more proactively than by hiding behind linaro. Unfortunately, to make any real headway here, this means providing binaries for every kernel driver interface, and the SoC vendor changes to those interfaces, on top of other bits of SoC specific integration. But this also means dealing with user support directly, and these users will of course spend half their time asking questions which should be aimed at the SoC vendor. How is ARM MPD going to convince SoC vendors to participate here? Or is ARM MPD going to maintain most of the SoC integration work themselves? Surely it will not keep the burden only at linaro, wasting the resources of the rest of ARM and of linaro partners?

ARM MPD just is in a totally different position than the ATIs and Nvidias of this world. Providing binaries that will satisfy a sufficient part of the need is going to be a huge drain on resources. Sure, MPD is not spending the same amount of resources on optimizing for specific setups and specific games like ATI or Nvidia are doing, but they will instead have to spend it on the different SoCs and devices out there. And that's before we start talking about different windowing infrastructure, beyond surfaceflinger, fbdev or X11. Think wayland, mir, even directFB, or any other special requirements that people tend to have for their embedded hardware.

At best, ARM MPD itself will manage to support surfaceflinger, fbdev and X11 on just a handful of popular devices. But how will ARM MPD know beforehand which devices are going to be popular? How will ARM MPD keep on making sure that the binaries match the available vendor or device kernel trees? Would MPD take the insane route of maintaining their own kernel repositories with a suitable mali kernel driver for those few chosen devices, and backporting changes from the real vendor trees instead? No way.

Attempting to solve this very MPD specific problem with only binaries, to any degree of success, is going to be a huge drain on MPD resources, and in the end, people will still not be satisfied. The problem will remain.

The only fitting solution is an open source driver. Of course, the Samsungs of this world will not ship their flagship phones with just an open source GPU driver in the next few years. But an open source driver will fundamentally solve the issues people currently have with Mali, the issues which fuel both the demand for fitting distributable binaries and for an open source driver. Only an open source driver can be flexible and cost-effective enough to fill that gap. Only an open source driver can get silicon vendors, device makers, solution creators and users chipping in, satisfying their own, very varied, needs.

Change is coming.

The ARM world is rapidly changing. Hardware review sites, which used to only review PC hardware, are more and more taking notice of what is happening in the mobile space. Companies that are still mostly stuck in embedded thinking are having to more and more act like PC hardware makers. The lack of sufficiently broad driver support is becoming a real issue, and one that cannot be solved easily or cheaply with a quick binary fix, especially for those who sell no silicon of their own.

The Mali marketing show on Anandtech tells us that things are looking up. The market is forcing ARM MPD to be more open, and MPD has to either sink or swim. The next step was demonstrated by yours truly and some other very enterprising individuals, and now both Nvidia and Broadcom are going all the way. It is just a matter of time before ARM MPD has to follow, as they need this more than their more progressive competitors.

To finish off, at the end of the Q&A session, someone asked: "Would free drivers gives greater value to the shareholders of ARM?". After a quick braindump, i concluded "Does ARMs lack of free drivers hurt shareholder value?" But we really should be stating "To what extent does ARMs lack of free drivers hurt shareholder value?".
July 16, 2014

appstream-logoToday I am very happy to announce the release of AppStream 0.7, the second-largest release (judging by commit number) after 0.6. AppStream 0.7 brings many new features for the specification, adds lots of good stuff to libappstream, introduces a new libappstream-qt library for Qt developers and, as always, fixes some bugs.

Unfortunately we broke the API/ABI of libappstream, so please adjust your code accordingly. Apart from that, any other changes are backwards-compatible. So, here is an overview of what’s new in AppStream 0.7:

Specification changes

Distributors may now specify a new <languages/> tag in their distribution XML, providing information about the languages a component supports and the completion-percentage for the language. This allows software-centers to apply smart filtering on applications to highlight the ones which are available in the users native language.

A new addon component type was added to represent software which is designed to be used together with a specific other application (think of a Firefox addon or GNOME-Shell extension). Software-center applications can group the addons together with their main application to provide an easy way for users to install additional functionality for existing applications.

The <provides/> tag gained a new dbus item-type to expose D-Bus interface names the component provides to the outside world. This means in future it will be possible to search for components providing a specific dbus service:

$ appstream-index what-provides dbus org.freedesktop.PackageKit.desktop system

(if you are using the cli tool)

A <developer_name/> tag was added to the generic component definition to define the name of the component developer in a human-readable form. Possible values are, for example “The KDE Community”, “GNOME Developers” or even the developer’s full name. This value can be (optionally) translated and will be displayed in software-centers.

An <update_contact/> tag was added to the specification, to provide a convenient way for distributors to reach upstream to talk about changes made to their metadata or issues with the latest software update. This tag was already used by some projects before, and has now been added to the official specification.

Timestamps in <release/> tags must now be UNIX epochs, YYYYMMDD is no longer valid (fortunately, everyone is already using UNIX epochs).

Last but not least, the <pkgname/> tag is now allowed multiple times per component. We still recommend to create metapackages according to the contents the upstream metadata describes and place the file there. However, in some cases defining one component to be in multiple packages is a short way to make metadata available correctly without excessive package-tuning (which can become difficult if a <provides/> tag needs to be satisfied).

As small sidenote: The multiarch path in /usr/share/appdata is now deprecated, because we think that we can live without it (by shipping -data packages per library and using smarter AppStream metadata generators which take advantage of the ability to define multiple <pkgname/> tags)

Documentation updates

In general, the documentation of the specification has been reworked to be easier to understand and to include less duplication of information. We now use excessive crosslinking to show you the information you need in order to write metadata for your upstream project, or to implement a metadata generator for your distribution.

Because the specification needs to define the allowed tags completely and contain as much information as possible, it is not very easy to digest for upstream authors who just want some metadata shipped quickly. In order to help them, we now have “Quickstart pages” in the documentation, which are rich of examples and contain the most important subset of information you need to write a good metadata file. These quickstart guides already exist for desktop-applications and addons, more will follow in future.

We also have an explicit section dealing with the question “How do I translate upstream metadata?” now.

More changes to the docs are planned for the next point releases. You can find the full project documentation at Freedesktop.

AppStream GObject library and tools

The libappstream library also received lots of changes. The most important one: We switched from using LGPL-3+ to LGPL-2.1+. People who know me know that I love the v3 license family of GPL licenses – I like it for tivoization protection, it’s explicit compatibility with some important other licenses and cosmetic details, like entities not loosing their right to use the software forever after a license violation. However, a LGPL-3+ library does not mix well with projects licensed under other open source licenses, mainly GPL-2-only projects. I want libappstream to be used by anyone without forcing the project to change its license. For some reason, using the library from proprietary code is easier than using it from a GPL-2-only open source project. The license change was also a popular request of people wanting to use the library, so I made the switch with 0.7. If you want to know more about the LGPL-3 issues, I recommend reading this blogpost by Nikos (GnuTLS).

On the code-side, libappstream received a large pile of bugfixes and some internal restructuring. This makes the cache builder about 5% faster (depending on your system and the amount of metadata which needs to be processed) and prepares for future changes (e.g. I plan to obsolete PackageKit’s desktop-file-database in the long term).

The library also brings back support for legacy AppData files, which it can now read. However, appstream-validate will not validate these files (and kindly ask you to migrate to the new format).

The appstream-index tool received some changes, making it’s command-line interface a bit more modern. It is also possible now to place the Xapian cache at arbitrary locations, which is a nice feature for developers.

Additionally, the testsuite got improved and should now work on systems which do not have metadata installed.

Of course, libappstream also implements all features of the new 0.7 specification.

With the 0.7 release, some symbols were removed which have been deprecated for a few releases, most notably as_component_get/set_idname, as_database_find_components_by_str, as_component_get/set_homepage and the “pkgname” property of AsComponent (which is now a string array and called “pkgnames”). API level was bumped to 1.

Appstream-Qt

A Qt library to access AppStream data has been added. So if you want to use AppStream metadata in your Qt application, you can easily do that now without touching any GLib/GObject based code!

Special thanks to Sune Vuorela for his nice rework of the Qt library!

And that’s it with the changes for now! Thanks to everyone who helped making 0.7 ready, being it feedback, contributions to the documentation, translation or coding. You can get the release tarballs at Freedesktop. Have fun!

July 14, 2014

Following Christian's Wayland in Fedora Update post, and after Hans fixed the touchpad acceleration, I've been playing with pointer acceleration in libinput a bit. The main focus was not yet on changing it but rather on figuring out what we actually do and where the room for improvement is. There's a tool in my (rather messy) github wip/ptraccel-work branchto re-generate the graphs below.

This was triggered by a simple plan: I want a configuration interface in libinput that provides a sliding scale from -1 to 1 to adjust a device's virtual speed from slowest to fastest, with 0 being the default for that device. A user should not have to worry about the accel mechanism itself, which may be different for any given device, all they need to know is that the setting -0.5 means "halfway between default and 'holy cow this moves like molasses!'". The utopia is of course that for any given acceleration setting, every device feels equally fast (or slow). In order to do that, I needed the right knobs to tweak.

The code we currently have in libinput is pretty much 1:1 what's used in the X server. The X server sports a lot more configuration options, but what we have in libinput 0.4.0 is essentially what the default acceleration settings are in X. Armed with the knowledge that any #define is a potential knob for configuration I went to investigate. There are two defines that are labelled as adjustible parameters:

  • DEFAULT_THRESHOLD, set to 0.4
  • DEFAULT_ACCELERATION, set to 2.0
But what do they mean, exactly? And what exactly does a value of 0.4 represent?
[side-note: threshold was 4 until I took the constant multiplier out, it's now 0.4 upstream and all the graphs represent that.]

Pointer acceleration is nothing more than mapping some input data to some potentially faster output data. How much faster depends on how fast the device moves, and to get there one usually needs a couple of steps. The trick of course is to make it predictable, so that despite the acceleration, your brain thinks that the visible cursor is an extension of your hand at all speeds.

Let's look at a high-level outline of our pointer acceleration code:

  • calculate the velocity of the current movement
  • use that velocity to calculate the acceleration factor
  • apply accel to dx/dy
  • smoothen out the dx/dy to avoid abrupt changes between two events

Calculating pointer speed

We don't just use dx/dy as values, rather, we use the pointer velocity. There's a simple reason for that: dx/dy depends on the device's poll rate (or interrupt frequency). A device that polls twice as often sends half the dx/dy values in each event for the same physical speed.

Calculating the velocity is easy: divide dx/dy by the delta time. We use a set of "trackers" that store previous dx/dy values with their timestamp. As long as we get movement in the same cardinal direction, we take those into account. So if we have 5 events in direction NE, the speed is averaged over those 5 events, smoothing out abrupt speed changes.

The acceleration function

The speed we just calculated is passed to the acceleration function to calculate an acceleration factor.

Figure 1: Mapping of velocity in unit/ms to acceleration factor (unitless). X axes here are labelled in units/ms and mm/s.
This function is the only place where DEFAULT_THRESHOLD/DEFAULT_ACCELERATION are used, but they mostly just stretch the graph. The shape stays the same.

The output of this function is a unit-less acceleration factor that is applied to dx/dy. A factor of 1 means leaving dx/dy untouched, 0.5 is half-speed, 2 is double-speed.

Let's look at the graph for the accel factor output (red): for very slow speeds we have an acceleration factor < 1.0, i.e. we're slowing things down. There is a distinct plateau up to the threshold of 0.4, after that it shoots up to roughly a factor of 1.6 where it flattens out a bit until we hit the max acceleration factor

Now we can also put units to the two defaults: Threshold is clearly in units/ms, and the acceleration factor is simply a maximum. Whether those are mentally easy to map is a different question.

We don't use the output of the function as-is, rather we smooth it out using the Simpson's rule. The second (green) curve shows the accel factor after the smoothing took effect. This is a contrived example, the tool to generate this data simply increased the velocity, hence this particular line. For more random data, see Figure 2.

Figure 2: Mapping of velocity in unit/ms to acceleration factor (unitless) for a random data set. X axes here are labelled in units/ms and mm/s.
For the data set, I recorded the velocity from libinput while using Firefox a bit.

The smoothing takes history into account, so the data points we get depend on the usage. In this data set (and others I tested) we see that the majority of the points still lie on or close to the pure function, apparently the delta doesn't matter that much. Nonetheless, there are a few points that suggest that the smoothing does take effect in some cases.

It's important to note that this is already the second smoothing to take effect - remember that the velocity (may) average over multiple events and thus smoothens the input data. However, the two smoothing effects somewhat complement each other: velocity smoothing only happens when the pointer moves consistently without much change, the Simpson's smoothing effect is most pronounced when the pointer moves erratically.

Ok, now we have the basic function, let's look at the effect.

Pointer speed mappings

Figure 3: Mapping raw unaccelerated dx to accelerated dx, in mm/s assuming a constant pysical device resolution of 400 dpi that sends events at 125Hz. dx range mapped is 0..127
The graph was produced by sending 30 events with the same constant speed, then dividing by the number of events to reduce any effects tracker feeding has at the initial couple of events.

The two lines show the actual output speed in mm/s and the gain in mm/s, i.e. (output speed - input speed). We can see that the little nook where the threshold kicks in and after the acceleration is linear. Look at Figure 1 again: the linear acceleration is caused by the acceleration factor maxing out quickly.

Most of this graph is theoretical only though. On your average mouse you don't usually get a delta greater than 10 or 15 and this graph covers the theoretical range to 127. So you'd only ever be seeing the effect of up to ~120 mm/s. So a more realistic view of the graph is:

Figure 4: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event).
Same data as Figure 3, but zoomed to the realistic range. We go from a linear speed increase (no acceleration) to a quick bump once the threshold is hit and from then on to a linear speed increase once the maximum acceleration is hit.

And to verify, the ratio of output speed : input speed:

Figure 5: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated.

Looks pretty much exactly like the pure acceleration function, which is to be expected. What's important here though is that this is the effective speed, not some mathematical abstraction. And it shows one limitation: we go from 0 to full acceleration within really small window.

Again, this is the full theoretical range, the more realistic range is:

Figure 6: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated. Zoomed in to a max of 120 mm/s (15 dx/event).
Same data as Figure 5, just zoomed in to a maximum of 120 mm/s. If we assume that 15 dx/event is roughly the maximum you can reach with a mouse you'll see that we've reached maximum acceleration at a third of the maximum speed and the window where we have adaptive acceleration is tiny.

Tweaking threshold/accel doesn't do that much. Below are the two graphs representing the default (threshold=0.4, accel=2), a doubled threshold (threshold=0.8, accel=2) and a doubled acceleration (threshold=0.4, accel=4).

Figure 6: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.
Figure 7: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, see Figure 5 for details. Zoomed in to a max of 120 t0.4 a4 (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.
Doubling either setting just moves the adaptive window around, it doesn't change that much in the grand scheme of things.

Now, of course these were all fairly simple examples with constant speed, etc. Let's look at a diagram of what is essentially random movement, me clicking around in Firefox for a bit:

Figure 8: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set.
And the zoomed-in version of this:
Figure 9: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set, zoomed in to events 450-550 of that set.
This is more-or-less random movement reflecting some real-world usage. What I find interesting is that it's very hard to see any areas where smoothing takes visible effect. the accelerated curve largely looks like a stretched input curve. tbh I'm not sure what I should've expected here and how to read that, pointer acceleration data in real-world usage is notoriously hard to visualize.

Summary

So in summary: I think there is room for improvement. We have no acceleration up to the threshold, then we accelerate within too small a window. Acceleration stops adjusting to the speed soon. This makes us lose precision and small speed changes are punished quickly.

Increasing the threshold or the acceleration factor doesn't do that much. Any increase in acceleration makes the mouse faster but the adaptive window stays small. Any increase in threshold makes the acceleration kick in later, but the adaptive window stays small.

We've already merged a number of fixes into libinput, but some more work is needed. I think that to get a good pointer acceleration we need to get a larger adaptive window [Citation needed]. We're currently working on that (and figuring out how to evaluate whatever changes we come up with).

A word on units

The biggest issue I was struggling with when trying to understand the code was that of units. The code didn't document used units anywhere but it turns out that everything was either in device units ("mickeys"), device units/ms or (in the case of the acceleration factors) was unitless.

Device units are unfortunately a pretty useless base entity, only slightly more precise than using the length of a piece of string. A device unit depends on the device resolution and of course that differs between devices. An average USB mouse tends to have 400 dpi (15.75 units/mm) but it's common to have 800 dpi, 1000 dpi and gaming mice go up to 8200dpi. A touchpad can have resolutions of 1092 dpi (43 u/mm), 3277 dpi (129 u/mm), etc. and may even have different resolutions for x and y.

This explains why until commit e874d09b4 the touchpad felt slower than a "normal" mouse. We scaled to a magic constant of 10 units/mm, before hitting the pointer acceleration code. Now, as said above the mouse would likely have a resolution of 15.75 units/mm, making it roughly 50% faster. The acceleration would kick in earlier on the mouse, giving the touchpad and the mouse not only different speeds but a different feel altogether.

Unfortunately, there is not much we can do about mice feeling different depending on the resolution. To my knowledge there is no way to query the resolution on a device. But for absolute devices that need pointer acceleration (i.e. touchpads) we can normalize to a fake resolution of 400 dpi and base the acceleration code on that. This provides the same feel on the mouse and the touchpad, as much as that is possible anyway.

July 13, 2014
  • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it’s there at the bottom.
  • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images.
  • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense.
  • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated.

The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It’s called, “true” because the Aliasing PPGTT was introduced first and often was simply called, “PPGTT.”

The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn’t realistically be enabled in isolation of the existing driver. When regressions occur it’s likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you’re reading this in some future where the stability problems are fixed upstream.

Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won’t be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2

A Brief History of the i915 Graphics Process

There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU.

Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher.

File Descriptors

Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate “who” was doing “what.” This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it’s just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time.

File descriptor isolation.  Before hardware contexts.

File descriptor isolation.Before hardware contexts.

Hardware Contexts

The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided  by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there’s not really any point in lamenting about how it could be better, now.

The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple.

struct drm_i915_gem_context_create {
	/* output: id of new context*/
	__u32 ctx_id;
	__u32 pad;
};

struct drm_i915_gem_context_destroy {
	__u32 ctx_id;
	__u32 pad;
};

As you can see from the two IOCTL payloads above, I wasn’t lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn’t a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list.

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea.

Hardware context isolation

Hardware context isolation

Full PPGTT

The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect.

PPGTT, full isolation

PPGTT, full isolation

If I write about this picture, there would be no point in continuing with an organized blog post :-). So I’ll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

Since the GGTT isn’t really mentioned much in this post, I’d like to point out  that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post.

VMAs and Address Spaces (AKA VMs)

The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn’t much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering.

Sharing VMAs

You can’t (see the note at the bottom). There’s not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I’ll keep things general and brief.

It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client’s process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again:

  1. VMA: (BO, Address Space) // Some BO mapped by the address space.
  2. VMA′: (BO′, Address Space) // Another BO mapped into the address space
  3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space.
VMA : PPGTT :: BO : GGTT

M = {1,2,3,…} N = {1,2,3,…}

In case it’s still unclear, I’ll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object.

NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can’t think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do.

Data Structures

Here are the relevant data structures cropped for the sake of brevity.

struct i915_address_space {
        struct drm_mm mm;
	unsigned long start;            /* Start offset always 0 for dri2 */
	size_t total;           /* size addr space maps (ex. 2GB for ggtt) */
	struct list_head active_list;
	struct list_head inactive_list;
};

struct i915_hw_ppgtt {
        struct i915_address_space base;
	int (*switch_mm)(struct i915_hw_ppgtt *ppgtt,
			 struct intel_engine_cs *ring,
			 bool synchronous);

};
struct i915_vma {
        struct drm_mm_node node;
        struct drm_i915_gem_object *obj;
        struct i915_address_space *vm;
};

The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I’ve not opted to do this. I feel there is too much duplication for not enough benefit.

I’ve already explained in different words that a range of used address space is the VMA.  If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA.

HOLE 0x0->0x64000 VMA 1 0x64000->0x69000 HOLE 0x69000->512M VMA 2 512M->512.004M HOLE 1.5GB

HOLE 0×0->0×64000
VMA 1 0×64000->0×69000
HOLE 0×69000->512M
VMA 2 512M->512.004M
HOLE ~512M->2GB
Allocated space: 0×6000 Free space: 0x7fffa000

Relation to the Hardware Context

struct intel_context {
	struct kref ref;
	int id;
	...
	struct i915_address_space *vm;
};

With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs.

Implicit Context (“private default context”)

From here on out we can consider a, “context” as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I’ve called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it’s state from the previous render operation when using the IOCTLs.

The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does).  There are a few solutions to this problem, and I’ve submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say – every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL.

Multi Context

A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy).

struct drm_i915_gem_execbuffer2 {
	/**
	 * List of gem_exec_object2 structs
	 */
	__u64 buffers_ptr;
	__u32 buffer_count;

	/** Offset in the batchbuffer to start execution from. */
	__u32 batch_start_offset;
	/** Bytes used in batchbuffer from batch_start_offset */
	__u32 batch_len;
	...
	__u64 flags;
	__u64 rsvd1; /* now used for context info */
	__u64 rsvd2;
};

A process may wish to create several GL contexts. The API allows this, and for reasons I don’t understand, it’s something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we’ve discussed for a GL context.

The Big Picture – literally

Overview

Overview

Context:PPGTT

One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context.

Quoting myself from one of earlier public declarations, here:

My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed.

My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I’m still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you’d get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it’s a bit easier to think of things this way too.

Current Mesa Current DDX 2 hypothetical scenarios

Current Mesa
Current DDX
2 hypothetical scenarios

Conclusion

Please feel free to send me issues or questions.
Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone.

State Machine

State Machine

TODO

As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don’t yell at me when things break :-).

Summary

That’s most of it. I like to give the 10 second summary.

  1. i915_vma, i915_hw_ppgtt, i915_address_space: important things.
  2. The GPU has a virtual address space per DRI file descriptor.
  3. There is a connection between the PPGTT, and a Hardware Context.
  4. VMAs are backed by BOs which are backed by physical pages.
  5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT.

And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.

Thing Modern X86 CPU Modern i915 GPU
Phys Address Limit 48b? ~40b
Process Isolation Yes Yes (with True PPGTT)
Virtual Address Space Yes Yes
64b VA Space Yes GEN8+ 48b only
PTE access controls Yes No
Page Fault Handling Yes No
Preemption7 Yes *With execlists

So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU – it still has a ways to go. I would be surprised if things didn’t continue going in this direction.

SVG Links

As usual, please feel free to do something useful with the images I’ve created. Also as usual, they are really poorly named.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg

Download PDF

  1. It’s technically possible to make them be the same BO through the two buffer sharing mechanisms. 

  2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU “process” 

  3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow. 

  4. I would have preferred the reservation of a space within the address space be called a, “GVMA”, but that was shot down during review 

  5. There’s a whole section below describing how this statement could be false. For now, let’s pretend address spaces can’t be shared 

  6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you’d expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees. 

  7. I am defining the word preemption as the ability to switch at an arbitrary point in time between contexts. On the CPU this is easily accomplished. The GPU running the i915 driver as of today has no way to do this. Once a batch is running it cannot be interrupted except for RC6. 

July 12, 2014

EDIT1 (2014-07-12): Apologies to planets for update.

  • Change b->B (bits to bytes) in the state walkthrough (thanks to Bernard Kilarski)
  • Convert SVG images to PNG because they weren’t being rendered properly.
  • Added TOC
  • Use new style footnotes
  • NOTE: With command parser merged, and execlists on the way – this post is already somewhat outdated.

Disclaimer: Everything documented below is included in the Intel public documentation. Anything I say which offends you are my own words and not those of Intel. Sadly, anything I say that is of monetary value are those if Intel.

Intro

Goal

My goal is to lay down a basic understanding of how GEN GPU execution works using gem_exec_nop from the intel-gpu-tools suite as an example. One who puts in the time to read this should understand how command submission works for the i915 driver, and how gem_exec_nop tests command submission. You should also have a decent idea of how the hardware handles execution. I intentionally skip topics like relocations, and how graphics virtual addresses are maintained. They are not directly related towards execution, and would make the blog entry too long.

Ideally, I am hoping this will enable people who are interested to file better bugs, improve our tests, or write their own tests.

Terminology

  • i915: The name of the Linux kernel driver for Intel GEN graphics. i915 is the name of an ancient chipset that was one of the first supported by the driver. The driver itself supports chipsets both before, and after i915.
  • BO: Buffer Object. GEM uses handles to identify the buffers used as graphics operands in order to avoidly costly copies from userspace to kernel space. BO is the thing which is encapsulated by that handle.
  • GEM: Graphics Execution Manager. The name of a design and API to give userspace GPU clients the ability to execute work on a GPU (the API is technically not specific to GEN).
  • GEN: The name of the Graphics IP developed by Intel Corporation.
  • GPU client: A userspace application or library that submits GPU work.
  • Graphics [virtual] Address: Address space used by the GPU for mapping system memory. GEN is an UMA architecture with regard to the CPU.
  • NOP/NOOP: An  assembly instruction mnemonic for a machine opcode that does no work. Note that this is not the same as a lack of work. The instruction is indeed executed, it simply has no side-effects. The execution latency is strictly greater than zero.
  • relocations: The way in which GEM manages to make GPU clients agnostic to where the buffers are actually mapped by the GPU. Out of scope for this blog entry.

Source Code

The source code in this post is found primarily in two places. Note that the links below are both from very fast moving code bases.

The test case: http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tests/gem_exec_nop.c

The driver internals: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_execbuffer.c

GEN Hardware

Before going over gem_exec_nop, I’d like to give an overview of modern GEN hardware:

Coarse GEN block diagram.

Coarse GEN block diagram.

I don’t want to say this is the exhaustive list, and indeed, each block above has many sub-components. In the part of the driver I work on this is a pretty logical way to split it. Each of the blocks share very little. The common denominator is a Graphics Virtual Address which is understood by all blocks. This provides an easy communication for work needing to be sent between components. As an example, the command streamer might want the display engine to flip to a new surface. It does so by sending a special message to the display engine along with the address of the surface to flip to. The display engine may respond “out of band” via interrupts (flip completion). There are also built in synchronization primitives that allow the command streamer to wait on events sent by the display engine (we’ll get to the command streamer in more detail later).

Excluding audio, since I know nothing about audio… by a very rough estimate, 85% of the Linux i915.ko code falls into “Other.” Of the remaining 15% in graphics processing engine, the kernel driver tends to utilize very little of the Fixed Func/EU block above. Total lines of code outside of the kernel driver for the EU block is enormous, given that the X 2d driver (DDX), mesa, libva, and beignet all have tons of lines of code just for utilizing that part of the hardware.

gem_exec_nop

gem_exec_nop is one of my favorite tests. For me, it’s the first test I run to determine whether or not to even bother with the rest of the test suite.

  • It’s dead simple.
  • It’s fast.
  • It tests a surprisingly large amount of the hardware, and software.
  • Gives some indication of performance
  • It’s deader than dead simple

It’s not a perfect test, some of the things which are missing:

  • Handling under memory pressure (relocs, swaps, etc.)
  • Tiling formats
  • Explicit testing of cacheability types, and coherency (LLC et al.)
  • several GEM interfaces
  • The aforementioned 85% of the driver
  • It doesn’t even execute a NOP instruction!!!

gem_exec_nop flowchart

NOTE: I will explain more about what a batchbuffer is later.

execbuf_5_steps

* (step 1) The docs say we must always follow MI_BATCH_BUFFER_END with an MI_NOOP. The presumed reason for this is that the hardware may prefetch the next instruction, and I think the designers wanted to dumb down the fact that they can't handle a pagefault on the prefetch, so they simply demand a MI_NOOP.
** (step1) MI_NOOP is defined as a dword of value 0x00000000. GEM BOs are 0 by default. So we have an implicit buffer full of MI_NOOP.
  1. Creating a batchbuffer is done using GEM APIs. Here we create a batchbuffer of size 4096, and fill in two instructions. The batchbuffer is the basic unit of execution. The only pertinent point to keep in mind is this is the only buffer being created for this test. Note that this step, or a similar one, is done in almost every test.
  2. Here we set up the data structure that will be passed to the kernel in an IOCTL. There’s a pointer to the list of buffers, in our case, just the one batchbuffer created instead one. The size of 8 (each of the two instructions is 4 bytes), and some flags which we’ll skip for now are also included in the struct.
  3. The dotted line through step 3 denotes the userspace/kernel barrier. Above the line is gem_exec_nop.c, below is i915_gem_execbuffer.c. DRM, which is a common subsystem interface, actually dispatches the IOCTLs to the i915 driver.
  4. The kernel handling the data is received. Talked about in more detail later.
  5. Submit to the GPU for execution. Also, detailed later.

Execbuf2 IOCTL and Hardware Execution

 i915.ko execbuffer2 handling (step 4 and 5 in the picture above)

The eventual goal of the kernel driver is to take the batchbuffer passed in from userspace, make sure it is visible to the GPU by mapping it, and then submit it to the GPU for execution. The aforementioned operations are synchronous with respect to the IOCTL1. In other words, by the time the execution returns to the application, the GPU knows about the work. The work is completed asynchronously.

I’ll detail some of the steps a bit. Unfortunately, I do not have pretty pictures for this one. You can follow along in i915_gem_execbuffer.c; i915_gem_do_execbuffer()

  1. copy_from_user – Copy the BOs in from userspace. Remember that the BO is a handle and not actual memory being copied; this allows a relatively small and fast copy to take place. In gem_exec_nop, there is exactly 1 BO: the batchbuffer.
  2. some sanity checks – not interesting
  3. look up – Do a lookup of all the handles for the BOs passed in via the buffers_ptr member (copied in during #1). Make sure the buffers still exist and so on. In our case this is only one buffer and it’s unlikely that it would be destroyed before execbuffer completes2
  4. Space reservation – Make sure there is enough address space in the GPU for the objects. This also includes checking for various alignment restrictions, and a few other details not really relevant to this specific topic. For our example, we’ll have to make sure we have enough space for 1 buffer of size 4096, and no special alignment requirements. It’s the second simplest request possible (first would be to have no buffers).
  5. Relocations – save for another day.
  6. Ring synchronization – Also not pertinent to gem_exec_nop. Since it involves the command streamer, I’ll include a brief description as a footnote3
  7. Dispatch – Finally we can tell the GEN hardware about the work that we just got. This means using some architectural registers to point the hardware at the batchbuffer which was submitted by userspace. More on this shortly…
  8. Some more relocation stuff – save for another day

Execution part I (Command Streamer/Ringbuffer)

Fundamentally, all work is submitted via a hardware ringbuffer, and fetching via the command streamer. A command streamer is many things, but for now, saying it’s a DMA engine for copying in commands and associated data is a good enough definition. The ringbuffer is a canonical ringbuffer with a HEAD and TAIL pointer (to be clear: TAIL is the one incremented by the CPU, and read by the GPU. HEAD is written by the GPU and read by the CPU). There is a third pointer known as ACTHD (or Active HEAD) – more on this later. At driver initialization, the space for the ringbuffer is allocated, and the address and size is written to hardware registers. When the driver wants to submit work, it writes data at the current TAIL pointer, and increments the TAIL pointer. Once the TAIL is incremented, the hardware will start reading in commands (via DMA), and increment the HEAD (and ACTHD) pointer as commands are retired.

Early GEN hardware had only 1 command streamer. It was referred to as, “CS.” When Ironlake introduced the VCS, or video engine command streamer, they renamed (in some places) the original CS to RCS, for render engine command streamer. Sandybridge introduced the blit engine command streamer BCS, and Haswell the video enhancement command streamer, or VECS. Each command streamer supports its own instruction set, though many instructions are the same on multiple command streamers, MI_NOOP is supported on all of them :P Having multiple command streamers not only provides an easy way to add new instructions, but it also allows an asynchronous way to submit work, which can be very useful if you are trying to do two orthogonal tasks. As an example, take an OpenCL application running in conjunction with your favorite 3d benchmark. The 3d benchmark internally will only use the 3d and blit hardware, while the OCL application will use the GPGPU hardware. It doesn’t make sense to have either one wait for a single command streamer to fetch the data (especially since I glossed over some other details which make it an even worse idea) if there won’t be any [or few] data dependencies.

The kernel driver is the only entity which can insert commands into the ringbuffer. The ringbuffer is therefore considered trusted, and all commands supported by the hardware may be run here (the docs use the word, “secure” but this gets confusing quickly). The way in which the batchbuffer we created in gem_exec_nop gets executed will be explained a bit further shortly, but the contents of that batchbuffer are not directly inserted into the ringbuffer4. Take a quick peek at the text in the image below for how it works.

Here is a pretty basic picture describing the above. The HEAD and TAIL point to the next instruction to be executed, therefore this would be midway through step #5 in the flowchart above.

ringbuffer

Execution part II (MI_BATCH_BUFFER_START, batchbuffer)

A batchbuffer is the way in which we can submit work to the GPU without having to write into the hardware ringbuffer (since only the kernel driver can do that). A batchbuffer is submitted to the GPU for execution via a command called MI_BATCH_BUFFER_START which is inserted into the ringbuffer and read by the command streamer. Batchbuffers share an instruction set with the command streamer that dispatched them (ie. batches run by the blit engine can issue blit commands), and the execution flow is very similar to that of the command streamer as described in the first diagram, and subsequently. On the other hand, there are quite a few differences. Batchbuffer execution is not guided by HEAD, and TAIL pointers. The hardware will continue to execute every instruction in a batchbuffer until it hits another MI_BATCH_BUFFER_START command, or an MI_BATCH_BUFFER_END. Yes, you can get into an infinite loop of batchbuffers with this nesting of MI_BATCH_BUFFER_START commands. The hardware has an internal HEAD pointer which is exposed for debug purposes called ACTHD. This pointer works exactly like a HEAD point would, except it is never compared against TAIL to determine the end of execution5. MI_BATCH_BUFFER_END will directly guide execution back the hardware ringbuffer. In other words you need only one MI_BATCH_BUFFER_END to break the chain of n MI_BATCH_BUFFER_STARTs.

Getting back to gem_exec_nop specifically for a sec: this is what we set up in step #1. Recall it had 2 instructions MI_BATCH_BUFFER_END, MI_NOOP.

batch

Here is our graphical representation of the batchbuffer from gem_exec_nop. Notice that the batchbuffer doesn’t have a tail pointer, only ACTHD.

Hardware states

The follow macro-level state machine/flowchart hybrid can be used to describe both ringbuffer execution and batchbuffer execution, though the descriptions differ slightly. By “macro-level” I mean each state may not match exactly to a state within the hardware’s state machines. It’s more of a state in the data flow. The “state machines” for both ringbuffers and batchbuffers are pretty similar. What follows is a diagram that mostly works for both, and a description of each state.

cs_state_machine

I’ll use “RSn” for ringbuffer state n, and “BSn” for batchbuffer state n.

  • RS0: Idle state, HEAD == TAIL. Waiting for driver to increment tail.
  • RS1: TAIL has changed. Fetch some amount between HEAD and TAIL (I’d guess it fetches the whole thing since the ringbuffer size is strictly limited).
  • RS2: Fetch has completed, and command parsing can begin. Command parsing here is relatively easy. Every command is 4B aligned, and has the total command length embedded in the first 4th (1 based) byte of the opcode. Once it has determined the length, it can send that many dwords to the next stage.
  • RS3: 1 command has been parsed and sent to be executed (pun intended).
  • RS4: Execute phase required some more work, if the command executed in RS3 requires some extra data, now is when it will get fetched – and AFAICT, the hardware will stall waiting for the fetch to complete. If there is nothing left to do for the command, HEAD is incremented. Most commands will be done and increment HEAD. MI_BATCH_BUFFER_START is a common exception.  I wish I could easily change the image… this is really RS3.5.
  • RS5: An error state requiring a GPU reset.
  • BS0: ASSERT(last command != MI_BATCH_BUFFER_END) This isn’t a real state. While executing a batchbuffer, you’re never idle. We can use this state as a place to update ACTHD though, so let’s say ACTHD := batchbuffer start address.
  • BS1: Similar to RS1, fetch the data. Hopefully most of it exists in some internal cache since we had to fetch some amount of it in RS4, but I don’t claim to know the micro-architecture details on this.
  • BS2: Just like RS2
  • BS3: Just like RS3
  • BS4: Just like RS4

gem_exec_nop state walkthrough

With the above knowledge, we can now step through the actual stuff from gem_exec_nop. This combines pretty much all the diagrams above (ie. you might want to reference them), I tried to keep everything factually correct along the way minus the address I make up below. Assume HEAD = 0×30, TAIL = 0×30, ACTHD=0×30

  1. Hardware is in Rs0.
  2. gem_exec_nop runs; submits previously discussed setup to i915.
  3. *** kernel picks address 0×22000 for the batchbuffer (remember I said we’re ignoring how graphics addresses work for now, so just play along)
  4. i915.ko writes 4 bytes, MI_BATCH_BUFFER_START to hardware ringbuffer.
  5. i915.ko writes 4 bytes, 0×22000 to hardware ringbuffer.
  6. i915.ko increments the tail pointer by command length (8). TAIL := 0×38
  7. RS0->RS1:  DMA fetches TAIL-HEAD bytes. (0×38-0×30) = 8B
  8. RS1->RS2: DMA completes. Parsing will find that the command is MI_BATCH_BUFFER_START, and it needs 1 extra dword to proceed. This 8B command is then ready to move on.
  9. RS2->RS3: Command was successfully parsed. There is a batchbuffer to be fetched, and once that completes we need to execute it.
  10. RS3->RS4: Execution was okay, DMA fetch of the batchbuffer at 0×22000 starts…completes
  11. RS4->BS0: ACTHD := 0×22000
  12. BS0->BS1: We’re in a batchbuffer. The commands we need to fetch are in our local cache, fetched by the ringbuffer just before so no need to do anything more.
  13. BS1->BS2: Parsing of the batchbuffer begins. The first command pointed to by ACTHD is MI_BATCH_BUFFER_END. It is only 4b.
  14. BS2->BS3: Parse was successful. Execute the command MI_BATCH_BUFFER_END. ACTHD := 4. There are no extra requirements for this command.
  15. BS3->RS0: Batchbuffer told us to end, so we go back to the ring. Increment our HEAD pointer by the size of the last command (8B). Set ACTHD equal to HEAD. HEAD := 0×38. ACTHD := 0×38.
  16. HEAD == TAIL… we’re idle.

Summary

User space builds up a command, and list of buffers. Then the userspace tells the kernel about it via IOCTL. Kernel does some work on the command to find all the buffers and so on, then submits it to the hardware. Some time later, userspace can see the results of the commands (not discussed in detail). On the hardware side, we’ve got a ringbuffer with a head and tail pointer, a way to dispatch commands which are located sparsely in our address space, and a way to get execution back to the ringbuffer.

SVG links

https://bwidawsk.net/blog/wp-content/uploads/2014/07/gen_block_diagram.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/execbuf_5_steps.svg

Download PDF

  1. The synchronous nature of the IOCTL is something which has been discussed several times. One usage model which would really like to break that is a GPU scheduler. In the case of a scheduler, we’d want to queue up work and return to userspace as soon as possible; but that work may not yet make it into the hardware. 

  2. Buffer objects are managed with a reference count. When a buffer is created, it gets a ref count of 1, and the refcount is decremented either when the object is explicitly destroyed, or the application ceases to exist. Therefore, the only way gem_exec_nop can fail during the look up portion of execbuffer, is if the application somehow dies after creating the batchbuffer, but before calling the execbuffer IOCTL. 

  3. As I showed in the first diagram we consider the command executed to be “in order.” Here this means that commands are executed sequentially, (and hand waving over some caching stuff) the side effects of commands are completed by execution of the later commands. This made the implicit synchronization that is baked in to the GEM API really easy to handle (the API has no ways to explicitly add synchronization objects). To put this another way, if a GPU client submits a command that operates on object X, then a second command also operating on object X; it was guaranteed to execute in that order (as long as there was no race condition in userspace submitting commands). However, when you have multiple instances of the in-order command streamers, synchronization is no longer free. If a command is submitted to command streamer1 referencing object X, and then a second command is submitted to command streamer2 also referencing object X… no guarantees are made by hardware about the order the of the commands. In this case, synchronization can be achieved in two ways: hardware based semaphores, or by stalling on the second command until that first one completes.
     

  4. Certain commands which may provide security risks are not allowed to be executed by untrusted entities. If the hardware parses such a command from an untrusted entity, it will convert it into an MI_NOOP. Batchbuffers can be executed in a trusted manner, but implementing such a thing is complex.
     

  5. When the CS is execution from the ring, HEAD == ACTHD. Once the CS jumps into the batchbuffer, ACTHD will take on the address within the batchbuffer, while HEAD will remain only relevant to it’s position in the ring. We use this fact to help us debug whether we hung in the batch, or in the ring. 

July 10, 2014

One feature we are spending quite a bit of effort in around the Workstation is container technologies for the desktop. This has been on the wishlist for quite some time and luckily the pieces for it are now coming together. Thanks to strong collaboration between Red Hat and Docker we have a great baseline to start from. One of the core members of the desktop engineering team, Alex Larsson, has been leading the Docker integration effort inside Red Hat and we are now preparing to build onwards on that work, using the desktop container roadmap created by Lennart Poettering.

So while Lennarts LinuxApps ideas predates Docker they do provide a great set of steps we need to turn Docker into a container solution not just for server and web applications, but also for desktop applications. And luckily a lot of the features we need for the desktop are also useful for the other usecases, for instance one of the main things Red Hat has been working with our friends at Docker on is integrating systemd with Docker.

There are a set of other components as part of this plan too. One of the big ones is Wayland, and I assume that if you are reading this you
have already seen my Wayland in Fedora updates.

Two other core technologies we identified are kdbus and overlayfs. Alex Larsson has already written an overlayfs backend for Docker, and Fedora Workstation Steering committee member, Josh Bowyer, just announced the availability of a Copr which includes experimental kernels for Fedora with overlayfs and kdbus enabled.

In parallel with this, David King has been prototyping a version of Cheese that can be run inside a container and that uses this concept that in the LinuxApps proposal is called ‘Portals’, which is basically dbus APIs for accessing resources outside the container, like the webcam and microphone in the case of Cheese. For those interested he will be presenting on his work at GUADEC at the end of the Month, on Monday the 28th of July. The talk is called ‘Cheese: TNG (less libcheese, more D-Bus)’

So all in all the pieces are really starting to come together now and we expect to have some sessions during both GUADEC and Flock this year to try hammer out the remaining details. If you are interested in learning more or join the effort be sure to check the two conferences notice boards for time and place for the container sessions.

There is still a lot of work to do, but I am confident we have the right team assembled to do it. In addition to the people already mentioned we for instance have Allan Day who is kicking off an effort to look at the user experience we want to provide around the container hosted application bundles in terms of upgrades and installation for instance. And we will also work with the wider Docker community to make sure we have great composition tools for creating these container images available for developers on Fedora.

July 04, 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!

Update: I had actually managed to disable the VAAPI encoding in 1.2, so I just rolled a 1.3 release which re-enabled it. Apart from that it is identical

So I finally managed to put out a new Transmageddon release today. It is primarily a bugfix release, but considering how many critical bugs I ended up fixing for this release I am actually a bit embarassed about my earlier 1.x releases. There was for instances some stupidity in my code that triggered thread safety issues, which I know hit some of my users quite badly. But there were other things not working properly either, like dropping the video stream from a file. Anyway, I know some people think that filing bugs doesn’t help, but I think I fixed every reported Transmageddon bug with this release (although not every feature request bugzilla item). So if you have issues with Transmageddon 1.2 please let me know and I will try my best to fix them. I do try to keep a policy that it is better to have limited functionality, but what is there is solid as opposed to have a lot of features that are unreliable or outright broken.

That said I couldn’t help myself so there are a few new features in this release. First of all if you have the GStreamer VAAPI plugins installed (and be sure to have the driver too) then the VAAPI GPU encoder will be used for h264 and MPEG2.

Secondly I brought back the so called ‘xvid’ codec (even though xvid isn’t really a separate codec, but a name used to refer to MPEG4 Video codec using the advanced-simple profile.).

So as screenshot blow shows, there is not a lot of UI changes since the last version, just some smaller layout and string fixes, but stability is hopefully greatly improved.
transmageddon-1.2

I am currently looking at a range of things as the next feature for Transmageddon including:

  • Batch transcoding, allowing you to create a series of transcoding jobs upfront instead of doing the transcodes one by one
  • Advanced settings panel, allowing you to choose which encoders to use for a given format, what profiles to use, turn deinterlacing on/off and so on
  • Profile generator, create new device profiles by inspecting existing files
  • Redo the UI to switch away from deprecated widgets

If you have any preference for which I should tackle first feel free to let me know in the comments and I will try to allow
popular will decide what I do first :)

P.S. I would love to have a high contrast icon for Transmageddon (HighContrast App icon guidelines) – So if there is any graphics artists out there willing to create one for me I would be duly greatful

July 03, 2014

As we are approaching Fedora Workstation 21 we held a meeting to review our Wayland efforts for Fedora Workstation inside Red Hat recently. Switching to a new core technology like Wayland is a major undertaking and there are always big and small surprises that comes along the way. So the summary is that while we expect to have a version of Wayland in Fedora Workstation 21 that will be able to run a fully functional desktop, there are some missing pieces we now know that will not make it. Which means that since we want to ship at least one Fedora release with a feature complete Wayland as an option before making it default, that means that Fedora Workstation 23 is the earliest Wayland can be the default.

Anyway, here is what you can expect from Wayland in Fedora 21.

  • Wayland session available in GDM (already complete and fully working)
  • XWayland working, but without accelerated 3D (done, adding accelerated 3D will be done before FW 22)
  • Wayland session working with all free drivers (Currently only Intel working, but we expect to have NVidia and AMD support enabled before F21)
  • IBUS input working. (Using the IBUS X client. Wayland native IBUS should be ready for FW22.)
  • Touchpad acceleration working. (Last missing piece for a truly usable Wayland session, lots of work around libinput and friends currently to have it ready for F21).
  • Wacom tablets will not be ready for F21
  • 3D games should work using the Wayland backend for SDL2. SDL1 games will need to wait for FW22 so they can use the accelerated XWayland support).
  • Binary driver support from NVidia and AMD very unlikely to be ready for F21.
  • Touch screen support working under Wayland.

We hope to have F21 testbuilds available soon that the wider community can use to help us test, because even when all the big ‘checkboxes’ are filled in there will of course be a host of smaller issues and outright bugs that needs ironing out before Wayland is ready to replace X completely. We really hope the community will get involved with testing Wayland so that we can iron out all major bugs before F21.

How to get involved with the Fedora Workstaton effort

To help more people get involved we recently put up a tasklist for the Fedora Workstation. It is a work in progress, but we hope that it will help more people get involved and help move the project forward.

UpdatePeter Hutterer posted this blog entry explaining pointer acceleration and what are looking at to improve it.

June 26, 2014

Hi folks,

Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later.

To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them.

ioctl calls vs software methods

An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl

Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example.

A first prototype using software methods

Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods.

Configure a counter

To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods.

Sample a counter

This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio.

Expose counter’s data to the userspace

Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries.

Synchronize queries with a sequence number

To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one.

This buffer looks like this :

 

schema_notifer_bo

To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it.

Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation.

I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them.

Design and implement a kernel interface with an elegant way takes a while…

See you soon for the full implementation!


June 25, 2014
Firewalls

Fedora has had problems for a long while with the default firewall rules. They would make a lot of things not work (media and file sharing of various sorts, usually, whether as a client or a server) and users would usually disable the firewall altogether, or work around it through micro-management of opened ports.

We went through multiple discussions over the years trying to break the security folks' resolve on what should be allowed to be exposed on the local network (sometimes trying to get rid of the firewall). Or rather we tried to agree on a setup that would be implementable for desktop developers and usable for users, while still providing the amount of security and dependability that the security folks wanted.

The last round of discussions was more productive, and I posted the end plan on the Fedora Desktop mailing-list.

By Fedora 21, Fedora will have a firewall that's completely open for the user's applications (with better tracking of what applications do what once we have application sandboxing). This reflects how the firewall was used on the systems that the Fedora Workstation version targets. System services will still be blocked by default, except a select few such as ssh or mDNS, which might need some tightening.

But this change means that you'd be sharing your music through DLNA on the café's Wi-Fi right? Well, this is what this next change is here to avoid.

Per-network Sharing

To avoid showing your music in the caf, or exposing your holiday photographs at work, we needed a way to restrict sharing to wireless networks where you'd already shared this data, and provide a way to avoid sharing in the future, should you change your mind.

Allan Day mocked up such controls in our Sharing panel which I diligently implemented. Personal File Sharing (through gnome-user-share and WedDAV), Media Sharing (through rygel and DLNA) and Screen Sharing (through vino and VNC) implement the same per-network sharing mechanism.

Make sure that your versions of gnome-settings-daemon (which implements the starting/stopping of services based on the network) and gnome-control-center match for this all to work. You'll also need the latest version of all 3 of the aforementioned sharing utilities.

(and it also works with wired network profiles :)



Lately at Collabora I have been working on helping Mozilla with the GTK+ 3 port of Firefox.

The problem

The issue we had to solve is that GTK+ 2 and GTK+ 3 cannot be loaded in the same address space. Moving Firefox from GTK+ 2 to GTK+ 3 isn’t a problem, as only GTK+ 3 gets loaded in its address space, and everything is fine. The problem comes when you load a plugin that links to GTK+ 2, e.g. Flash. Then, GTK+ 2 and GTK+ 3 get both loaded, GTK+ detects that, and aborts to avoid bigger problems. This was tracked as bug #624422.

More specifically, Firefox links to libxul.so, which in turn links to GTK+. These days, the plugins are loaded in a separate process, plugin-container, which communicates with the Firefox process through IPC. If plugin-container didn’t link to GTK+, there would be absolutely no problem, as the browser (Firefox) process could link to GTK+ 3 and plugin-container could load any plugin, including GTK+ 2 ones. However, although plugin-container doesn’t directly use GTK+, it links to libxul.so for IPC, which brings GTK+ into its address space.

The solution

In order to solve this, we evaluated various options. The first one was to split libxul.so in two parts, one with the IPC code and lower level stuff, which wouldn’t link to GTK+, and another side with the rest of the code, including all the widget and toolkit integration, which would obviously link to GTK+. However this turned not to be possible as the libxul code was too intricate.

In the end, we decided to add a thin layer between libxul and GTK+, which we called libmozgtk.so. This small layer links to GTK+ 3, and provides stubs for GTK+ 2 specific symbols. Additionally, there is a libmozgtk2.so with SONAME “libmozgtk.so”, which links to GTK+ 2 and provides stubs for GTK+ 3 symbols. We made libxul link against libmozgtk.so, and so when Firefox runs, libxul.so, libmozgtk.so, and GTK+ 3 are loaded, and Firefox uses GTK+ 3. However when plugin-container is executed, we add LD_PRELOAD=libmozgtk2.so in the environment. Since libmozgtk2.so has a libmozgtk.so SONAME, the libxul.so dependency is satisfied, and the plugin-container process ends with GTK+ 2. Since plugin-container doesn’t make use of the GTK+ code in libxul, this is safe, and we end up with a GTK+ 3 Firefox that can load GTK+ 2 plugins. The end result is that you can watch Youtube videos again!

While this solution is somewhat hacky, it means we didn’t need to mess with libxul, splitting it in two just for the Linux/GTK+ port’s sake. And when the GTK+ 2 plugins become irrelevant, or NPAPI support is removed (as it recently happened in Chrome), we should be able to easily revert this and use GTK+ 3 everywhere.

Wayland

On an unrelated note, we have looked a bit at porting Firefox to Wayland. Wayland is designed to be a replacement for X11, and is becoming very popular in the digital TV and set top box space. Those obviously need HTML engines and web browsers, and with WebKit and Chrome already having Wayland ports, we think Firefox shouldn’t fall behind.

For this, the GTK+ 3 port was a prerequisite, but that isn’t enough. There are many X11 uses on the Firefox codebase, most of which are guarded by #ifdef MOZ_X11, though not all of them are. We got Firefox to start on Weston (the Wayland reference compositor) with a bunch of hacks, one of which broke keyboard input (but avoided a segfault). As you can see from the screenshot, things aren’t perfect, but it’s at least a good start!

Firefox running on Weston

June 23, 2014
This will, I think, be the first time blogging about something quite so retroactively, but for reasons which should be apparent, I could not blog about this little adventure until now.  This is the story of CVE-2014-0972 (QCIR-2014-00004-1), and (at least part of) how I was able to install fedora on my firetv:

Introduction..

Back in April, I bought myself a Fire TV, with the thought that it would make a nice fedora xbmc htpc setup, complete with open src drivers, to replace my aging pandaboard.  But, of course, as delivered the Fire TV is locked down with no root access.

At the same time, there was a feature of the downstream android kernel gpu driver (kgsl), per-context pagetables, which had been on my TODO list for the upstream drm/msm driver for a while now.  But, I needed to understand better what kgsl was doing and the interactions with the hardware, in particular the behaviour of the CP (command processor), in order to convince myself that such a feature was safe.  People generally frown on introducing root holes in the upstream kernel, and I didn't exactly have documentation about the hardware.  So it was time to roll up my sleeves and get some hands-on experience (translation: try to poke and crash the gpu in lots of different ways and try to make sense of the result).

Into the rabbit hole..

The modern snapdragon SoCs use IOMMUs everywhere.  Including the GPU.  To implement per-context gpu pagetables, basically all the driver needs to do is to bang a few IOMMU registers to change the pagetable base addr and invalidate the TLB.  But this must be done when you are sure the GPU is not still trying to access memory mapped in the old page tables.  Since a GPU is a highly asynchronous device, it would be a big performance hit to stall until GPU ringbuffer drains, then reprogram IOMMU, then resume the GPU with commands from the new context.  To avoid this performance hit, kgsl maps some of the IOMMU registers into the GPU's virtual address space, and emits commands into the ringbuffer for the CP to write the necessary registers to switch pagetables and invalidate TLB.

It was this reprogramming of IOMMU from the GPU itself which I needed to understand better.  Anyone who understands GPU's would have the initial reaction that this is extremely dangerous.  But kgsl was, it seemed, taking some protections.  However, I needed to be sure I properly understood how this worked, to see if there was something that was overlooked.

The GPU, in fact, has two hw contexts which it can switch between.  Essentially it is in some ways similar to supervisor vs user context on a CPU.  The way kgsl uses this is to map the IOMMU registers into the supervisor context, but not user contexts.  The ringbuffer is mapped into all the user contexts, plus supervisor context, at the same device virtual address.  The idea being that if the ringbuffer is mapped in the same position in all contexts, you can safely context switch from commands in the ringbuffer.

To do this, kgsl emits commands for the CP to write a special bit in CP_STATE_DEBUG_INDEX to switch to the "supervisor" context.  Then commands to write IOMMU registers, followed by write to CP_STATE_DEBUG_INDEX to switch back to user context.  (I'm over-simplifying slightly, as there are some barriers needed to account for asynchronous writes.)  But userspace constructed commands never execute from the ringbuffer, instead the kernel puts an IB (indirect branch) into the ringbuffer to jump to the userspace constructed cmdstream buffer.  This userspace cmdstream buffer is never mapped into supervisor context, or into other user's contexts.  So in theory, if userspace tried to write CP_STATE_DEBUG_INDEX to switch to supervisor mode (and gain access to the IOMMU registers), the GPU would immediately page fault, since the cmdstream it was in the middle of executing is no longer mapped.  Ok, so far, so good.

Where it breaks down..

From my attempts at switching to supervisor mode from IB1, and deciphering the fault address where the gpu crashed, and iommu register dumps, I could tell that the next few commands after the switch to supervisor mode where excuted without problem.. there is some prefetch/pipelining!

But much more conveniently, while poking around, I realized that there were a couple pages mapped globally (in supervisor and all user contexts), which where mapped writable in user contexts.  I used the so called "setstate" buffer.  So I simply had to construct a cmdstream buffer to write the commands I wanted to execute into the setstate buffer, and then do an IB to that buffer and do the supervisor switch in IB2.

Ok.. but do do anything useful with this, I'd need a reasonable chunk of physically contiguous pages, at a known physical address.. in particular 16K for first level pagetables and 16K second level pagetables.  Fortunately ION comes to the rescue here, with it's physically contiguous carveouts at known physical addresses.  In this case, allocate from the multimedia pool when there is no video playback, etc, going on.  This way ION allocates from the beginning of the carveout pool, a known address.

Into this buffer, construct a new set of pagetables, which map whatever physical address you want to read/write (hint, any of kernel lowmem), a replacement page for the setstate buffer (since we don't know the original setstate buffer's physical address.. which means we actually have two copies of the commands copied into setstate buffer, one copied via gpu to original setstate page, and one written directly by cpu in the replacement setstate page).


The proof of concept that I made simply copied the string "Kilroy was here" into a kernel buffer.  But quite easily any random app downloaded from an untrusted source could access any memory, become root, etc.  Not the sort of thing you want falling into the wrong hands.

Once I managed to prove to myself that I understood properly how the hw was working, I wrote up a short report, and submitted it (plus proof of concept) to the qualcomm security team.

Now that the vulnerability is no longer embargoed, I've made available the proof of concept and report here.

Originally I planned to (once fixes were pushed out, so as to not put someone who did not intend to root their device at risk) release a jailbreak based on this vulnerability.  But once towelroot was released, there was no longer a need for me to turn this into an actual firetv jailbreak.  Which saves me from having to figure out how to make an apk.

Parting thoughts..

  1. Well, knownledge about physical addresses and contiguous memory in userspace, while it might not be a security problem in and of itself, it sure helps turn other theoritical exploits into actual exploits.
  2. As far as downstream vendor drivers go, the kgsl driver is actually pretty decent, in terms of code quality, etc.  I've seen far worse.  Admittedly this was not a trivial hole.  But imagine what issues lurk in other downstream gpu/camera/video/etc drivers.  Security is often not simple, and I really doubt whether the other downstream drivers are getting a critical look (from good-guys who will report the issue responsibly).
  3. I used to think of the whole one-kernel-branch-per-device wild-west ways of android as a bit of a headache.  Now I realize it is a security nightmare.  An important part of platform security is being able to react quickly when (not if) vulnaribilites are found.  In the desktop/server world, CVEs are usually not embargoed for more than a week.. that is all you need, since fortunately we don't need a different kernel for each different make and model of server, laptop, etc.  In the mobile device world, it is quite a different story!

June 22, 2014
It's been a week now, and I've made surprising amounts of progress on the project.

I came in with this giant task list I'd been jotting down in Workflowy (Thanks for the emphatic recommendation of that, Qiaochu!). Each of the tasks I had were things where I'd have been perfectly unsurprised if they'd taken a week or two. Instead, I've knocked out about 5 of them, and by Friday I had phire's "hackdriver" triangle code running on a kernel with a relocations-based GEM interface. Oh, sure, the code's full of XXX comments, insecure, and synchronous, but again, a single triangle rendering in a month would have been OK with me.

I've been incredibly lucky, really -- I think I had reasonable expectations given my knowledge going in. One of the ways I'm lucky is that my new group is extremely helpful. Some of it is things like "oh, just go talk to Dom about how to set up your serial console" (turns out minicom fails hard, use gtkterm instead. Also, someone else will hand you a cable instead of having to order one, and Derek will solder you a connector. Also, we hid your precious dmesg from the console after boot, sorry), but it extends to "Let's go have a chat with Tim about how to get modesetting up and running fast." (We came up with a plan that involves understanding what the firmware does with the code I had written already, and basically whacking a register beyond that. More importantly, they handed me a git tree full of sample code for doing real modesetting, whenever I'm ready.).

But I'm also lucky that there's been this community of outsiders reverse engineering the hardware. It meant that I had this sample "hackdriver" code for drawing a triangle with the hardware entirely from userspace, that I could incrementally modify to sit on top of more and more kernel code. Each step of the way I got to just debug that one step to go from "does not render a triangle" back to "renders that one triangle." (Note: When a bug in your command validator results in pointing the framebuffer at physical address 0 and storing the clear color to it, the computer will go away and stop talking to you. Related note: When a bug in your command validator results in reading your triangle from physical address 0, you don't get a triangle. It's like a I need a command validator for my command validator.).

https://github.com/anholt/linux/tree/vc4 is the code I've published so far. Starting Thursday night I've been hacking together the gallium driver. I haven't put it up yet because 1) it doesn't even initialize, but more importantly 2) I've been using freedreno as my main reference, and I need to update copyrights instead of just having my boilerplate at the top of everything. But next week I hope to be incrementally deleting parts of hackdriver's triangle code and replacing it with actual driver code.
June 20, 2014

NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows.

This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) .  To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows.

In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ?

Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ?

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation.

Case #2 : How NVPerfKit handles only one counter per domain ?

This is the simplest case, and there are no particular requirements.

Case #3 : How NVPerfKit handles multiple counters per domain ?

NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame.

Case #4 : How NVPerfKit handles multiple counters on different domains ?

No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor).

To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain.

Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference):

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week.

See you the next week!


June 18, 2014
bartholomea-annulata

Bartholomea annulata | (c) Kevin Bryant

It is time for a new Tanglu update, which has been overdue for a long time now!

Many things happened in Tanglu development, so here is just a short overview of what was done in the past months.

Infrastructure

Debile

The whole Tanglu distribution is now built with Debile, replacing Jenkins, which was difficult to use for package building purposes (although Jenkins is great for other things). You can see the Tanglu builders in action at buildd.tg.o.

The migration to Debile took a lot of time (a lot more than expected), and blocked the Bartholomea development at the beginning, but now it is working smoothly. Many thanks to all people who have been involved with making Debile work for Tanglu, especially Jon Severinsson. And of course many thanks to the Debile developers for helping with the integration, Sylvestre Ledru and of course Paul Tagliamonte.

Archive Server Migration

Those who read the tanglu-announce mailinglist know this already: We moved the main archive server stuff at archive.tg.o to to a new location, and to a very powerful machine. We also added some additional security measures to it, to prevent attacks.

The previous machine is now being used for the bugtracker at bugs.tg.o and for some other things, including an archive mirror and the new Tanglu User Forums. See more about that below :-)

Transitions

There is huge ongoing work on package transitions. Take a look at our transition tracker and the staging migration log to get a taste of it.

Merging with Debian Unstable is also going on right now, and we are working on merging some of the Tanglu changes which are useful for Debian as well (or which just reduce the diff to Tanglu) back to their upstream packages.

Installer

Work on the Tanglu Live-Installer, although badly needed, has not yet been started (it’s a task ready for taking by anyone who likes to do it!) – however, some awesome progress has been made in making the Debian-Installer work for Tanglu, which allows us to perform minimal installations of the Tanglu base systems and allows easier support of alternative Tanglu falvours. The work on d-i also uncovered a bug which appeared with the latest version of findutils, which has been reported upstream before Debian could run into it. This awesome progress was possible thanks to the work of Philip Muškovac and Thomas Funk (in really hard debug sessions).

Tanglu ForumsTanglu Users

We finally have the long-awaited Tanglu user forums ready! As discussed in the last meeting, a popular demand on IRC and our mailing lists was a forum or Stackexchange-like service for users to communicate, since many people can work better with that than with mailinglists.

Therefore, the new English TangluUsers forum is now ready at TangluUsers.org. The forum software is in an alpha version though, so we might experience some bugs which haven’t been uncovered in the testing period. We will watch how the software performs and then decide if we stick to it or maybe switch to another one. But so far, we are really happy with the Misago Forums, and our usage of it already led to the inclusion of some patches against Misago. It also is actively maintained and has an active community.

Misc Thingstanglu logo pure

KDE

We will ship with at least KDE Applications 4.13, maybe some 4.14 things as well (if we are lucky, since Tanglu will likely be in feature-freeze when this stuff is released). The other KDE parts will remain on their latest version from the 4.x series. For Tanglu 3, we might update KDE SC 4.x to KDE Frameworks 5 and use Plasma 5 though.

GNOME

Due to the lack manpower on the GNOME flavor, GNOME will ship in the same version available in Debian Sid – maybe with some stuff pulled from Experimental, where it makes sense. A GNOME flavor is planned to be available.

Common infrastructure

We currently run with systemd 208, but a switch to 210 is planned. Tanglu 2 also targets the X.org server in version 1.16. For more changes, stay tuned. The kernel release for Bartholomea is also not yet decided.

Artwork

Work on the default Tanglu 2 design has started as well – any artwork submissions are most welcome!

Tanglu joins the OIN

The Tanglu project is now a proud member (licensee)  of the Open Invention Network (OIN), which build a pool of defensive patents to protect the Linux ecosystem from companies who are trying to use patents against Linux. Although the Tanglu community does not fully support the generally positive stance the OIN has about software patents, the OIN effort is very useful and we agree with it’s goal. Therefore, Tanglu joined the OIN as licensee.


And that’s the stuff for now! If you have further questions, just join us on #tanglu or #tanglu-devel on Freenode, or write to our newly created forum! – You can, as always, also subscribe to our mailinglists to get in touch.

June 17, 2014

(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)

In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
  2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.

Concepts

A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:

  1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
  2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var (We will call these stateless systems.).

A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).

Problems

Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.

Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.

To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:

  • A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances.

    To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r.

    Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary.

    Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users.

  • We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/ld.so.cache.
  • If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting.
  • We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier.

What's Next

Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:

  • When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing.
  • When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts.
  • When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc.
  • Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.

Conclusion

Before we finish, let me stress again why we are doing all this:

  1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
  2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
  7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.

Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.

Anyway, let's get the ball rolling! Late's make stateless systems a reality!

And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.

Yesterday was my first day working at Broadcom. I've taken on a new role as an open source developer there. I'm going to be working on building an MIT-licensed Mesa and kernel DRM driver for the 2708 (aka the 2835), the chip that's in the Raspberry Pi.

It's going to be a long process. What I have to work with to start is basically sample code. Talking to the engineers who wrote the code drops we've seen released from Broadcom so far, they're happy to tell me about the clever things they did (their IR is pretty cool for the target subset of their architecture they chose, and it makes instruction scheduling and register allocation *really* easy), but I've had universal encouragement so far to throw it all away and start over.

So far, I'm just beginning. I'm still working on getting a useful development environment set up and building my first bits of stub DRM code. There are a lot of open questions still as to how we'll manage the transition from having most of the graphics hardware communication managed by the VPU to having it run on the ARM (since the VPU code is a firmware blob currently, we have to be careful to figure out when it will stomp on various bits of hardware as I incrementally take over things that used to be its job).

I'll have repos up as soon as I have some code that does anything.

Overview

Pictures are the right way to start.

appgtt_concept

Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

  1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.

Architecture

The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There’s some things we can get from this for those too lazy to click on the links to the docs.

  1. PPGTT page tables exist in physical memory.
  2. PPGTT PTEs have the exact same layout as GGTT PTEs.
  3. PDEs don’t have cache attributes (more on this later).
  4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

  • A PDE is 4 bytes wide
  • A PTE is 4 bytes wide
  • A Page table occupies 4k of memory.
  • There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.

real_appgtt

An object with an aliased PPGTT mapping

Size

PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB

Location

PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.

Programming

With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

...
ret = intel_ring_begin(ring, 6);
if (ret)
	return ret;

intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(2));
intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);
intel_ring_advance(ring);
...

As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.

Initialization

All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->gtt.base.mm,
					  &ppgtt->node, GEN6_PD_SIZE,
					  GEN6_PD_ALIGN, 0,
					  0, dev_priv->gtt.base.total,
					  DRM_MM_TOPDOWN);

2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
	if (!ppgtt->pt_pages[i]) {
		gen6_ppgtt_free(ppgtt);
		return -ENOMEM;
	}
}

3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	dma_addr_t pt_addr;

	pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,
			       PCI_DMA_BIDIRECTIONAL);
	...
}

As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.

IOMMU

As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
	ecochk |= ECOCHK_PPGTT_WB_HSW;
} else {
	ecochk |= ECOCHK_PPGTT_LLC_IVB;
	ecochk &= ~ECOCHK_PPGTT_GFDT_IVB;
}
I915_WRITE(GAM_ECOCHK, ecochk);

What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.

ggtt_flow

Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.

 

Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

  • Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
    • i915_gem_object_pin_to_display_plane(): page flipping
    • intel_setup_overlay(): overlays
  • Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
    • allocate_ring_buffer()
  • HW Contexts: Extremely similar to ringbuffer.
    • intel_alloc_context_page(): Ironlake RC6
    • i915_gem_create_context(): Create the default HW context
    • i915_gem_context_reset(): Re-pin the default HW context
    • do_switch(): Pin the logical context we’re switching to
  • Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
    • init_status_page()
  • Workarounds:
    • init_pipe_control(): Initialize scratch space for workarounds.
    • intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
    • render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
  • Other
    • i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here.
    • i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
    • i915_gem_pin_ioctl(): The DRI1 pin interface.

GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

  • PTE size increased to 8b
    • Therefore, 512 entries per table
    • Format mimics the CPU PTEs
  • PDEs increased to 8b (remains 512 PDEs per PD)
    • Page Directories live in system memory
      • GGTT no longer holds the PDEs.
    • There are 4 PDPs, and therefore 4 PDs
    • PDEs are cached in LLC instead of special cache (I’m guessing)
  • New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
    • PP_DIR_BASE, and PP_DCLV are removed
  • Support for 4 level page tables, up to 48b virtual address space.
    • PML4[PML4E]->PDP
    • PDP[PDPE] -> PD
    • PD[PDE] -> PT
    • PT{PTE] -> Memory
  • Big pages are now 64k instead of 32k (still not implemented)
  • New caching interface via PAT like structure

Summary

There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.

Recapping:

  • The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
  • Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
  • GEN8 changed a lot of architectural stuff.
  • The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.
https://bwidawsk.net/blog/wp-content/uploads/2014/06/appgtt_concept.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/real_ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/ggtt_flow.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/ppgtt_flow.svg

Download PDF

  1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT 

  2. Page walk, Two-Level Per-Process Virtual Memory 

  3. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority. 

  4. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches 

  5. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry( 

June 11, 2014
So over the past few years the drm subsystem gained some very nice documentation. And recently we've started to follow suite with the Intel graphics driver. All the kernel documenation is integrated into one big DocBook and I regularly upload latest HTML builds of the Linux DRM Developer's Guide. This is built from drm-intel-nightly so has slightly freshed documentation (hopefully) than the usual DocBook builds from Linus' main branch which can be found all over the place. If you want to build these yourself simply run

$ make htmldocs

For testing we now also have neat documentation for the infrastructure and helper libraries found in intel-gpu-tools. The README in the i-g-t repository has detailed build instructions - gtkdoc is a bit more of a fuzz to integrate.

Below the break some more details about documentation requirements relevant for developers.

So from now on I expect reasonable documentation for new, big kernel features and for new additions to the i-g-t library.

For i-g-t the process is simple: Add the gtk-doc comment blocks to all newly added functions, install and build with gtk-doc enabled. Done. If the new library is tricky (for example the pipe CRC support code) a short overview section that references some functions to get people started is useful, but not really required. And with the exception of the still in-flux kernel modesetting helper library i-g-t is fully documented, so there's lots of examples to copy from.

For the kernel this is a bit more involved, mostly since kerneldoc sucks more. But we also only just started with documenting the drm/i915 driver itself.
  1. First extract all the code for your new feature into a new file. There's unfortunately no other way to sensibly split up and group the reference documentation with kerneldoc. But at least that will also be a good excuse to review the related interfaces before extracting them.
  2. Create reference kerneldoc comments for the functions used as interfaces to the rest of the driver. It's always a bit a judgement call what to document and what not, since compared to the DRM core where functions must be explicitly exported to drivers there's no clean separate between the core parts and subsystems and more mundane platform enabling code. For big and complicated features it's also good practice to have an overview DOC: section somewhere at the beginning of the file.
  3. Note that kerneldoc doesn't have support for markdown syntax (or anything else like that) and doesn't do automatic cross-referencing like gtk-doc. So if you documentation absolutely needs a table or a list you have to do it twice unfortunately: Once as a plain code comment and once as a DocBook marked-up table or list. Long-term we want to improve the kerneldoc markup support, but for now we have to deal with what we have.
  4. As with all documentation don't document the details of the implementation - otherwise it will get stale fast because comments are often overlooked when updating code.
  5. Integrate the new kerneldoc section into the overall DRM DocBook template. Note that you can't go deeper than a section2 nesting for otherwise the reference documentation won't be lists, and due to the lack of any autogenerated cross-links inaccessible and useless. Build the html docs to check that your overview summary and reference sections have all been pulled in and that the kerneldoc parser is happy with your comments.
A really nice example for how to do this all is the documentation for the gen7 cmd parser in i915_cmd_parser.c.
June 10, 2014

videotape

Introduction

Gobi chipsets are mobile broadband modems developed by Qualcomm, and they are nowadays used by lots of different manufacturers, including Sierra Wireless, ZTE, Huawei and of course Qualcomm themselves.

These devices will usually expose several interfaces in the USB layer, and each interface will then be published to userspace as different ‘ports’ (not the correct name, but I guess easier to understand). Some of the interfaces wil give access to serial ports (e.g. ttys) in the modem, which will let users execute standard connection procedures using the AT protocol and a PPP session. The main problem with using a PPP session over a serial port is that it makes it very difficult, if not totally impossible, to handle datarates above 3G, like LTE. So, in addition to these serial ports, Gobi modems also provide access to a control port (speaking the QMI protocol) and a network interface (think of it as a standard ethernet interface). The connection procedure then can be executed purely through QMI (e.g. providing APN, authentication…) and then userspace can use a much more convenient network interface for the real data communication.

For a long time, the only way to use such QMI+net pair in the Linux kernel was to use the out-of-tree GobiNet drivers provided by Qualcomm or by other manufacturers, along with user-space tools also developed by them (some of them free/open, some of them proprietary). Luckily, a couple of years ago a new qmi_wwan driver was developed by Bjørn Mork and merged into the upstream kernel. This new driver provided access to both the QMI port and the network interface, but was much simpler than the original GobiNet one. The scope was reduced so much, that most of the work that the GobiNet driver was doing in kernel-space, now it had to be done by userspace applications. There are now at least 3 different user-space implementations allowing to use QMI devices through the qmi_wwan port: ofono, uqmi and of course, libqmi.

The question, though, still remains. What should I use? The upstream qmi_wwan kernel driver and user-space utilities like libqmi? Or rather, the out-of-tree GobiNet driver and user-space utilities provided by manufacturers? I’m probably totally biased, but I’ll try to compare the two approaches by pointing out their main differences.

Note: you may want to read the ‘Introduction to libqmi‘ post I wrote a while ago first.

in-tree vs out-of-tree

The qmi_wwan driver is maintained within the upstream Linux kernel (in-tree). This, alone, is a huge advantage compared to GobiNet. Kernel updates may modify the internal interfaces they expose for the different drivers, and being within the same sources as all the other ones, the qmi_wwan driver will also get those updates without further effort. Whenever you install a kernel, you know you’ll have the qmi_wwan driver applicable to that same kernel version ready, so its use is very straightforward. The qmi_wwan driver also contains support for Gobi-based devices from all vendors, so regardless of whether you have a Sierra Wireless modem or a Huawei one (just to name a few), the driver will be able to make your device work as expected in the kernel.

GobiNet is a whole different story. There is not just one GobiNet: each manufacturer keeps its own. If you’re using a Sierra Wireless device you’ll likely want to use the GobiNet driver maintained by them, so that for example, the specific VID/PID pairs are already included in the driver; or going a bit deeper, so that the driver knows which is supposed to be the QMI/WWAN interface number that should be used (different vendors have different USB interface layouts). In addition to the problem of requiring to look for the GobiNet driver most suitable for your device, having the drivers maintained out-of-tree means that they need to provide a single set of sources for a very long set of kernel versions. The sources, therefore, are full of #ifdefs enabling/disabling different code paths depending on the kernel version targeted, so maintaining it gets to be much more complicated than if they just had it in-tree.

Note: Interestingly, we’ve already seen fixes that were first implemented in qmi_wwan ‘ported’ to GobiNet variants.

Complexity

The qmi_wwan driver is simple; it will just get a USB interface and split it into a QMI-capable /dev/cdc-wdm port (through the cdc-wdm driver) and a wwan network interface. As the kernel only provides basic transport to and from the device, it is left to user-space the need to manage the QMI protocol completely, including service client allocations/releases as well as the whole internal CTL service. Note, though, that this is not a problem; user-space tools like libqmi will do this work nicely.

The GobiNet driver is instead very complex. The driver also exposes a control interface (e.g. /dev/qcqmi) and a network interface, but all the work that is done through the internal CTL service is done at kernel-level. So all client allocations/releases for the different services are actually performed internally, not exposed to user-space. Users will just be able to request client allocations via ioctl() calls, and client releases will be automatically managed within the kernel. In general, it is never advisable to have such a complex driver. As complexity of a driver increases, so does the likelyhood of having errors, and crashes in a driver could affect the whole kernel. Quoting Bjørn, the smaller the device driver is, the more robust the system is.

Note: Some Android devices also support QMI-capable chipsets through GobiNet (everything hidden in the kernel and the RIL). In this case, though, you may see that shared memory can also be used to talk to the QMI device, instead of a /dev/qcqmi port.

Device initialization

One of the first tasks that is done while communicating with the Gobi device is to set it up (e.g. decide which link-layer protocol to use in the network interface) and make sure that the modem is ready to talk QMI. In the case of the GobiNet driver, this is all done in kernel-space; while in the case of qmi_wwan everything can be managed in user-space. The libqmi library allows several actions to be performed during device initialization, including the setting of the link-layer protocol to use. There are, for example, models from Sierra Wireless (like the new MC7305) which expose by default one QMI+network interface (#8) configured to use 802.3 (ethernet headers) and another QMI+network interface (#10) configured to use raw IP (no ethernet headers). With libqmi, we can switch the second one to use 802.3, which is what qmi_wwan expects, thus allowing us to use both QMI+net pairs at the same time.

Multiple processes talking QMI

One of the problems of qmi_wwan is that only one process is capable of using the control port at a given time. The GobiNet driver, instead, allows multiple processes to concurrently access the device, as each process would get assigned different QMI clients with different client IDs directly from the kernel, hence, not interfering with each other. In order to handle this issue, libqmi (since version 1.8) was extended to implement a ‘qmi-proxy’ process which would be the only one accessing the QMI port, but which would allow different process to communicate with the device concurrently (by sharing and synchronizing the CTL service among the connected peers).

User-space libraries

The GobiNet driver is designed to be used along with Qualcomm’s C++ GobiAPI library in user-space. On top of this library, other manufacturers (like Sierra Wireless) provide additional libraries to use specific features of their devices. This GobiAPI library will handle itself all the ioctl() calls required to e.g. allocate new clients, and will also provide a high level API to access the different QMI services and operations in the device.

In the case of the qmi_wwan driver, as already said, there are several implementations which will let you talk QMI with the device. libqmi, which I maintain, is one of them. libqmi provides a GLib-based C library, and therefore it exposes objects and interfaces which provide access to the most used QMI services in any kind of device. The CTL service, the internal one which was managed in the kernel by GobiNet, will be managed internally by libqmi and therefore mostly hidden to the users of the library.

Note: It is not (yet) possible to mix GobiAPI with qmi_wwan and e.g. libqmi with GobiNet. Therefore, it is not (yet) possible to use libqmi or qmicli in e.g. an Android device with a QMI-capable chipset.

User-space command line tools

I am no really aware of any general purpose command line tool developed to be used with the GobiNet driver (well, firmware loader applications, but those are not general purpose). The lack of command line tools may be likely due to the fact that, as QMI clients are released automatically by the GobiNet kernel, it is not easy (if at all possible) to leave a QMI client allocated and re-use it over and over by a command line tool which executes an action and exits.

With qmi_wwan, though, as clients are not automatically released, command line tools are much easier to handle. The libqmi project includes a qmicli tool which is able to execute independent QMI requests in each run of the program, even re-using the same QMI client in each of the runs if needed. This is especially important when launching a connection, as the WDS client which executes the “Start Network” command must be kept registered as long as the connection is open, or otherwise the connection will get dropped.

New firmware loading

The process of loading new firmware into a QMI-based device is not straightforward. It involves several interactions at QMI-level, plus a QDL based download of the firware to the device (kind of what gobi_loader does for Gobi 2K). Sadly, there is not yet a way to perform this operation when using qmi_wwan and its user-space tools. If you’re in the need of updating the firmware of the device, the only choice left is to use the GobiNet driver plus the vendor-provided programs.

Support

One of the advantages of the GobiNet driver is that every manufacturer will (should) give direct support for their devices if that kernel driver is used. Actually, there are vendors which will only give support for the hardware if their driver is the one in use. I’m therefore assuming that GobiNet may be a good choice for companies if they want to rely in the vendor-provided support, but likely not for standard users which just happen to have a device of this kind in their systems.

But, even if it is not the official support, you can anyway still get in touch with the libqmi mailing list if you’re experiencing issues with your QMI device; or contact companies or individuals (e.g. me!) which provide commercial support for the qmi_wwan driver and libqmi/qmicli integration needs.


Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Planets Tagged: Gobi, GobiNet, libqmi, linux, QMI

Two months ago in April ’14 I’ve been in San Francisco to meet with other FOSS developers and discuss current projects. There were several events, including the first GNOME Westcoast Summit and a systemd hackfest at Pantheon. I’ve been working on a lot of stuff lately and it was nice to talk directly to others about it. I wrote in-depth articles (on this blog) for the most interesting stories, but below is a short overview of what I focused on in SF:

  • memfd: My most important project currently is memfd. We fixed several bugs and nailed down the API. It was also nice to get feedback from a lot of different projects about interesting use-cases that we didn’t think of initially. As it turns out, file-sealing is something a lot of people can make great use of.
  • MiracleCast: For about half a year I worked on the first Open-Source implementation of Miracast. It’s still under development and only working Sink-side, but there are plans to make it work Client-side, too. Miracast allows to replace HDMI cables with wireless-solutions. You can connect your monitor, TV or projector via standard wifi to your desktop and use it as mirror or desktop-extension. The monitor is sink-side and MiracleCast can already provide a full Miracast stack for it. However, for the more interesting Source-side (eg., a Gnome-Desktop) I had a lot of interesting discussions with Gnome developers how to integrate it. I have some prototypes running locally, but it will definitely take a lot longer before it works properly. However, the current sink-side implementation has a latency of approx. 50ms and can run 30fps 1080p. This is already pretty impressive and on-par with proprietary solutions.
  • kdbus: The new general-purpose IPC mechanism is already fleshed out, but we spent a lot of time fixing races in the code and doing some general code review. It is a very promising project and all of the criticism I’ve heard so far was rubbish. People tend to rant about moving dbus in the kernel, even though kdbus really has nearly nothing to do with dbus, except that it provides an underlying data-bus infrastructure. Seriously, the helpers used for kernel-mode-setting without including the driver-specific code is already much bigger than kdbus… and in my opinion, kdbus will make dbus a lot more efficient and appealing to new developers.
  • GPU: GPU-switching, offload-GPUs and USB/wifi display-controllers are few of the many new features in the graphics subsystem. They’re mostly unsupported in any user-space, so we decided to change that. It’s all highly technical and the way how it is supposed to work is fairly obvious. Therefore, I will avoid discussing the details here. Lets just say, on-demand and live GPU-switching is something I’m making possible as part of GSoC this summer.
  • User-bus: This topic sounds fairly boring and technical, but it’s not. The underlying question is: What happens if you log in multiple times as the same user on the same system? Currently, a desktop system either rejects multiple logins of the same user or treats them as separate, independent logins. The second approach has the problem that many applications cannot deal with this. Many per-user resources have to be shared (like the home-directory). Firefox, for instance, cannot run multiple times for the same user. However, no-one wants to prevent multiple logins of the same user, as it really is a nice feature. Therefore, we came up with a hybrid aproach which basically boils down to a single session shared across all logins of the same user. So if you login twice, you get the same screen for both logins sharing the same applications. The window-manager can put you on a separate virtual desktop, but the underlying session is basically the same. Now if you do the same across multiple seats, you simply merge both sessions of these seats into a single huge session with the screen mirrored across all assigned monitors. A more in-depth article will follow once the details have been figured out.

A lot of the things I worked on deal with the low-level system and are hardly visible to the average Gnome user. However, without a proper system API, there’s no Gnome and I’m very happy the Gnome Foundation is acknowledging this by sponsoring my trip to SF: Thanks a lot! And hopefully I’ll see you again next year!


For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

  • ftruncate(2) to change the file size
  • read(2), write(2) and all its derivatives to inspect/modify file contents
  • mmap(2) to get a direct memory-mapping
  • dup(2) to duplicate file-descriptors

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

  • memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
  • There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

  • F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
  • F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
  • F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
  • F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

  • IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
  • Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.


Linus decided to have a bit fun with the 3.16 merge window and the 3.15 release, so I'm a bit late with our regular look at the new stuff for the Intel graphics driver.
First things first, Baytrail/Valleyview has finally gained support for MIPI DSI panels! Which means no more ugly hacks to get machines like the ASUS T100 going for users and no more promises we can't keep from developers - it landed for real this time around. Baytrail has also seen a lot of polish work in e.g. the infoframe handling, power domain reset, ...

Continuing on the new hardware platform this release features the first version of our prelimary support for Cherryview. At a very high level this combines a Gen8 render unit derived from Broadwell with a beefed-up Valleyview display block. So a lot of the enabling work boiled down to wiring up existing code, but of course there's also tons of new code to get all the details rights. Most of the work has been done by Ville and Chon Ming Lee with lots of help from other people.

Our modeset code has also seen lots of improvements. The user-visible feature is surely support for large cursors. On high-dpi panels 64x64 simply doesn't cut it and the kernel (and latest SNA DDX) now support up to the hardware limit of 256x256. But there's also been a lot of improvements under the hood: More of Ville's infrastructure for atomic pageflips has been merged - slowly all the required pieces like unified plane updates for modeset, two stage watermark updates or atomic sprite updates are falling into place. Still a lot of work left to do though. And the modesetting infrasfrastucture has also seen a bit of work by the almost complete removal of the ->mode_set hooks. We need that for both atomic modeset updates and for proper runtime PM support.

On that topic: Runtime power management is now enabled for a bunch of our recent platforms - all the prep work from Paulo Zanoni and Imre Deak in the past few releases has finally paid off. There's still leftovers to be picked up over the coming releases like proper runtime PM support for DPMS on all platforms, addressing a bunch of crazy corner cases, rolling it out on the newer platforms like Cherryview or Broadwell and cleaning the code up a bit. But overall we're now ready for what the marketing people call "connected standy", which means that power consumption with all devices turned off through runtime pm should be as low as when doing a full system suspend. It crucially relies upon userspace not sucking and waking the cpu and devices up all the time, so personally I'm not sure how well this will work out really.

Another piece for proper atomic pageflip support is the universal primary plane support from Matt Roper. Based upon his DRM core work in 3.15 he now enabled the universal primary plane support in i915 properly. Unfortunately the corresponding patches for cursor support missed 3.16. The universal plane support is hence still disabled by default. For other atomic modeset work a shout-out goes to Rob Clark who's locking conversion to wait/wound mutexes for modeset objects has been merged.

On the GEM side Chris Wilson massively improved our OOM handling. We are now much better at surviving a crash against the memory brickwall. And if we don't and indeed run out of memory we have much better data to diagnose the reason for the OOM. The top-down PDE allocator from Ben Widawsky better segregates our usage of the GTT and is one of the pieces required before we can enable full ppgtt for production use. And the command parser from Brad Volkin is required for some OpenGL and OpenCL features on Haswell. The parser itself is fully merged and ready, but the actual batch buffer copying to a secure location missed the merge window and hence it's not yet enabled in permission granting mode.

The big feature to pop the champagne though is the userptr support from Chris - after years I've finally run out of things to complain about and merged it. This allows userspace to wrap up any memory allocations obtained by malloc() (or anything else backed by normal pages) into a GEM buffer object. Useful for faster uploads and downloads in lots of situation and currently used by the DDX to wrap X shmem segments. But OpenCL also wants to use this.

We've also enabled a few Broadwell features this time around: eDRAM support from Ben, VEBOX2 support from Zhao Yakui and gpu turbo support from Ben and Deepak S.

And finally there's the usual set of improvements and polish all over the place: GPU reset improvements on gen4 from Ville, prep work for DRRS (dynamic refresh rate switching) from Vandana, tons of interrupt and especially vblank handling rework (from Paulo and Ville) and lots of other things.

In Solaris 11.1, I updated the system headers to enable use of several attributes on functions, including noreturn and printf format, to give compilers and static analyzers more information about how they are used to give better warnings when building code.

In Solaris 11.2, I've gone back in and added one more attribute to a number of functions in the system headers: __attribute__((__deprecated__)). This is used to warn people building software that they’re using function calls we recommend no longer be used. While in many cases the Solaris Binary Compatibility Guarantee means we won't ever remove these functions from the system libraries, we still want to discourage their use.

I made passes through both the POSIX and C standards, and some of the Solaris architecture review cases to come up with an initial list which the Solaris architecture review committee accepted to start with. This set is by no means a complete list of Obsolete function interfaces, but should be a reasonable start at functions that are well documented as deprecated and seem useful to warn developers away from. More functions may be flagged in the future as they get deprecated, or if further passes are made through our existing deprecated functions to flag more of them.

Header Interface Deprecated by Alternative Documented in
<door.h> door_cred(3C) PSARC/2002/188 door_ucred(3C) door_cred(3C)
<kvm.h> kvm_read(3KVM), kvm_write(3KVM) PSARC/1995/186 Functions on kvm_kread(3KVM) man page kvm_read(3KVM)
<stdio.h> gets(3C) ISO C99 TC3 (Removed in ISO C11), POSIX:2008/XPG7/Unix08 fgets(3C) gets(3C) man page, and just about every gets(3C) reference online from the past 25 years, since the Morris worm proved bad things happen when it’s used.
<unistd.h> vfork(2) PSARC/2004/760, POSIX:2001/XPG6/Unix03 (Removed in POSIX:2008/XPG7/Unix08) posix_spawn(3C) vfork(2) man page.
<utmp.h> All functions from getutent(3C) man page PSARC/1999/103 utmpx functions from getutentx(3C) man page getutent(3C) man page
<varargs.h> varargs.h version of va_list typedef ANSI/ISO C89 standard <stdarg.h> varargs(3EXT)
<volmgt.h> All functions PSARC/2005/672 hal(5) API volmgt_check(3VOLMGT), etc.
<sys/nvpair.h> nvlist_add_boolean(3NVPAIR), nvlist_lookup_boolean(3NVPAIR) PSARC/2003/587 nvlist_add_boolean_value, nvlist_lookup_boolean_value nvlist_add_boolean(3NVPAIR) & (9F), nvlist_lookup_boolean(3NVPAIR) & (9F).
<sys/processor.h> gethomelgroup(3C) PSARC/2003/034 lgrp_home(3LGRP) gethomelgroup(3C)
<sys/stat_impl.h> _fxstat, _xstat, _lxstat, _xmknod PSARC/2009/657 stat(2) old functions are undocumented remains of SVR3/COFF compatibility support

If the above table is cut off when viewing in the blog, try viewing this standalone copy of the table.

To See or Not To See

To see these warnings, you will need to be building with either gcc (versions 3.4, 4.5, 4.7, & 4.8 are available in the 11.2 package repo), or with Oracle Solaris Studio 12.4 or later (which like Solaris 11.2, is currently in beta testing). For instance, take this oversimplified (and obviously buggy) implementation of the cat command:

#include <stdio.h>

int main(int argc, char **argv) {
    char buf[80];

    while (gets(buf) != NULL)
	puts(buf);
    return 0;
}
Compiling it with the Studio 12.4 beta compiler will produce warnings such as:
% cc -V
cc: Sun C 5.13 SunOS_i386 Beta 2014/03/11
% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221

The exact warning given varies by compilers, and the compilers also have a variety of flags to either raise the warnings to errors, or silence them. Of couse, the exact form of the output is Not An Interface that can be relied on for automated parsing, just shown for example.

gets(3C) is actually a special case — as noted above, it is no longer part of the C Standard Library in the C11 standard, so when compiling in C11 mode (i.e. when __STDC_VERSION__ >= 201112L), the <stdio.h> header will not provide a prototype for it, causing the compiler to complain it is unknown:

% gcc -std=c11 gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: implicit declaration of function ‘gets’ [-Wimplicit-function-declaration]
     while (gets(buf) != NULL)
     ^
The gets(3C) function of course is still in libc, so if you ignore the error or provide your own prototype, you can still build code that calls it, you just have to acknowledge you’re taking on the risk of doing so yourself.

Solaris Studio 12.4 Beta

% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221

% cc -errwarn=E_DEPRECATED_ATT gets_test.c
"gets_test.c", line 6:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221
cc: acomp failed for gets_test.c
This warning is silenced in the 12.4 beta by cc -erroff=E_DEPRECATED_ATT
No warning is currently issued by Studio 12.3 & earler releases.

gcc 3.4.3

% /usr/sfw/bin/gcc gets_test.c
gets_test.c: In function `main':
gets_test.c:6: warning: `gets' is deprecated (declared at /usr/include/iso/stdio_iso.h:221)

Warning is completely silenced with gcc -Wno-deprecated-declarations

gcc 4.7.3

% /usr/gcc/4.7/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]

% /usr/gcc/4.7/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

gcc 4.8.2

% /usr/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]
     while (gets(buf) != NULL)
     ^

% /usr/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
     while (gets(buf) != NULL)
     ^
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

Global Graphics Translation Tables

Here goes the basics of how the GEN GPU interacts with memory. It will be focused on the lowest levels of the i915 driver, and the hardware interaction. My hope is that by going through this in excruciating detail, I might be able to take more liberties in the future posts.

What are the Global Graphics Translation Table

The graphics translation tables provide the address mapping from the GPU’s virtual address space to a physical address1. The GTT is somewhat a relic of the AGP days ( GART) with the distinction being that the GTT as it pertains to Intel GEN GPUs has logic that is contained within the GPU, and does not act as a platform IOMMU. I believe (and wikipedia seems to agree) that GTT and GART were used interchangeably in the AGP days.

GGTT architecture

Each element within the GTT is an entry, and the initialism for each entry is a, “PTE” or page table entry. Much of the required initialization is handled by the boot firmware. The i915 driver will get any required information from the initialization process via PCI config space, or MMIO.

Intel/GEN UMA system

Example illustrating Intel/GEN memory organization:

Location

The table is located within system memory, and is allocated for us by the BIOS or boot firmware. To clarify the docs a bit, GSM is the portion of stolen memory for the GTT, DSM is the rest of stolen memory used for misc things. DSM is the stolen memory referred to by the current i915 code as “stolen memory.” In theory we can get the location of the GTT from MMIO MPGFXTRK_CR_MBGSM_0_2_0_GTTMMADR (0×108100, 31:20), but we do not do that. The register space, and the GTT entries are both accessible within BAR0 (GTTMMADR).

All the information can be found in Volume 12, p.129: UNCORE_CR_GTTMMADR_0_2_0_PCI. Quoting directly from the HSW spec, “The range requires 4 MB combined for MMIO and Global GTT aperture, with 2MB of that used by MMIO and 2MB used by GTT. GTTADR will begin at GTTMMADR 2 MB while the MMIO base address will be the same as GTTMMADR.”

In the below code you can see we take the address in the PCI BAR and add half the length to the base. For all modern GENs, this is how things are split in the BAR.

/* For Modern GENs the PTEs and register space are split in the BAR */
gtt_phys_addr = pci_resource_start(dev->pdev, 0) +
	(pci_resource_len(dev->pdev, 0) / 2);

dev_priv->gtt.gsm = ioremap_wc(gtt_phys_addr, gtt_size);

One important thing to notice above is that the PTEs are mapped in a write-combined fashion. Write combining makes sequential updates (something which is very common when mapping objects) significantly faster. Also, the observant reader might ask, ‘why go through the BAR to update the PTEs if we have the actual physical memory location.’ This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates. If this weren’t required, a nice optimization might be to update all the entries as once with the CPU, and then go tell the GPU to invalidate the TLBs.

Size

Size is a bit more straight forward. We just read the relevant PCI offset. In the docs: p.151 GSA_CR_MGGC0_0_2_0_PCI offset 0×50, bits 9:8

And the code is even more straightforward.

static inline unsigned int gen6_get_total_gtt_size(u16 snb_gmch_ctl)
{
        snb_gmch_ctl >>= SNB_GMCH_GGMS_SHIFT;
        snb_gmch_ctl &= SNB_GMCH_GGMS_MASK;
        return snb_gmch_ctl << 20;
}
pci_read_config_word(dev->pdev, SNB_GMCH_CTRL, &snb_gmch_ctl);
gtt_size = gen6_get_total_gtt_size(snb_gmch_ctl);
gtt_total = (gtt_size / sizeof(gen6_gtt_pte_t)) << PAGE_SHIFT;

Layout

The PTE layout is defined by the PRM and as an example, can be found on page 35 of HSW – Volume 5: Memory Views. For convenience, I have reconstructed the important part here:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:322 Cacheability Control[2:0] Valid

The valid bit is always set for all GGTT PTEs. The programming notes tell us to do this (also on page 35 of HSW – Volume 5: Memory Views)3.

Putting it together

As a result, of what we’ve just learned, we can make up a function to write the PTEs.:

/**
 * gen_write_pte() - Write a PTE entry
 * @dev_priv:	The driver private structure
 * @address:	The physical address to back the graphics VA
 * @entry:	Which PTE in the table to update
 * @cache_type: Preformatted cache type. Varies by platform
 */
static void
gen_write_pte(dev_priv, phys_addr_t address,
	      unsigned int entry, uint32_t cache_type)
{
	/* Total size, divided by the PTE size is the max entry */
	BUG_ON(entry >= (gtt_total / 4);
	/* We can only use 38 address bits */
	BUG_ON(address >= (1<<39); 

	uint32_t pte = lower_32_bits(address) |
		       (upper_32_bits(address) << 4) |
		       cache_type |
		       1;
	iowrite32(pte, dev_priv->gtt.gsm + (entry * 4));
}

Example

Let’s analyze a real HSW running something. We can do this with the tool in the intel-gpu-tools suite, intel_gtt, passing it the -d option4.

GTT offset |                 PTEs
--------------------------------------------------------
  0x000000 | 0x0ee23025 0x0ee28025 0x0ee29025 0x0ee2a025
  0x004000 | 0x0ee2b025 0x0ee2c025 0x0ee2d025 0x0ee2e025
  0x008000 | 0x0ee2f025 0x0ee30025 0x0ee31025 0x0ee32025
  0x00c000 | 0x0ee33025 0x0ee34025 0x0ee35025 0x0ee36025
  0x010000 | 0x0ee37025 0x0ee13025 0x0ee1a025 0x0ee1b025
  0x014000 | 0x0ee1c025 0x0ee1d025 0x0ee1e025 0x0ee1f025
  0x018000 | 0x0ee80025 0x0ee81025 0x0ee82025 0x0ee83025
  0x01c000 | 0x0ee84025 0x0ee85025 0x0ee86025 0x0ee87025

And just to continue beating the dead horse, let’s breakout the first PTE:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid
0xee23000 0 0×2 0×2 1

Physical address: 0x20ee23000
Cache type: 0x2 (WB in LLC Only – Aged "3")
Valid: yes

Definition of a GEM BO

We refer to virtually contiguous locations which are mapped to specific graphics operands as one of, objects, buffer objects, BOs, or GEM BOs.

In the i915 driver, the verb, “bind” is used to describe the action of making a GPU virtual address range point to the valid backing pages of a buffer object.5 The driver also reuses the verb, “pin” from the Linux mm, to mean, prevent the object from being unbound.

bo_mapped

Example of  a “bound” GPU buffer

Scratch Page

We’ve already talked about the scratch page twice, albeit briefly. There was an indirect mention, and of course in the image directly above. The scratch page is a single page allocated from memory which every unused GGTT PTE will point to.

To the best of my knowledge, the docs have never given a concrete explanation for the necessity of this, however one might assume unintentional  behavior should the GPU talk a page fault. One would be right to interject at this point with the fact that by the very nature of DRI drivers, userspace can almost certainly find a way to hang the GPU. Why should we bother to protect them against this particular issue? Given that the GPU has undefined (read: Not part of the behavioral specification) prefecthing behavior, we cannot guarantee that even a well behaved userspace won’t invoke page faults6. Correction: after writing this, I went and looked at the docs. They do explain exactly which engines can, and cannot take faults. The “why” seems to be missing however.

Mappings and the aperture

The Aperture

First we need to take a bit of a diversion away from GEN graphics (which to repeat myself, are all of the shared memory type). If one thinks of traditional discrete graphics devices, there is always embedded GPU memory. This poses somewhat of an issue given that all end user applications require the CPU to run. The CPU still dispatches work to the GPU, and for cases like games, the event loop still runs on the CPU. As a result, the CPU needs to be able to both read, and write to memory that the GPU will operate on. There are two common solutions to this problem.
  • DMA engine
    • Setup overhead.
      • Need to deal with asynchronous (and possibly out of order) completion. Latencies involved with both setup and completion notification.
      • Need to actually program the interface via MMIO, or send a command to the GPU7
    • Unlikely to re-arrange or process memory
      • tile/detile surfaces8.
      • can’t take page faults, pages must be pinned
    • No size restrictions (I guess that’s implementation specific)
    • Completely asynchronous – the CPU is free to do whatever else needs doing.
  • Aperture
    • Synchronous. Not only is it slow, but the CPU has to hand hold the data transfer.
    • Size limited/limited resource. There is really no excuse with PCIe and modern 64b platforms why the aperture can’t be as large as needed, but for Intel at least, someone must be making some excuses, because 512MB is as large as it gets for now.
    • Can swizzle as needed (for various tiling formats).
    • Simple usage model. Particularly for unified memory systems.
aper_example

Moving data via the aperture

dma_example

Moving data via DMA

The Intel GEN GPUs have no local memory9. However, DMA has very similar properties to writing the backing pages directly on unified memory systems. The aperture is still used for accesses to tiled memory, and for systems without LLC. LLC is out of scope for this post.

GTT and MMAP

There are two distinct interfaces to map an object for reading or writing. There are lots of caveats to the usage of these two methods. My point isn’t to explain how to use them (libdrm is a better way to learn to use them anyway). Rather I wanted to clear up something which confused me early on.

The first is very straightforward, and has behavior I would have expected.

struct drm_i915_gem_mmap {
#define DRM_I915_GEM_MMAP       0x1e
	/** Handle for the object being mapped. */
	__u32 handle;
	__u32 pad;
	/** Offset in the object to map. */
	__u64 offset;
	/**
	 * Length of data to map.
	 *
	 * The value will be page-aligned.
	 */
	__u64 size;
	/**
	 * Returned pointer the data was mapped at.
	 *
	 * This is a fixed-size type for 32/64 compatibility.
	 */
	__u64 addr_ptr;
};

// let bo_handle = some valid GEM BO handle to a 4k object
// What follows is a way to map the BO, and write something
memset(&arg, 0, sizeof(arg));
arg.handle = bo_handle;
arg.offset = 0;
arg.size = 4096;
ioctl(fd, DRM_IOCTL_I915_GEM_MMAP, &arg);
*((uint32_t *)arg.addr_ptr) = 0xdefecate;

I might be projecting my ineptitude on the reader, but, it’s the second interface which caused me a lot of confusion, and one which I’ll talk briefly about. The interface itself is even simpler smaller:

#define DRM_I915_GEM_MMAP_GTT   0x24
struct drm_i915_gem_mmap_gtt {
	/** Handle for the object being mapped. */
	__u32 handle;
	__u32 pad;
	/**
	 * Fake offset to use for subsequent mmap call
	 *
	 * This is a fixed-sizeso [sic] type for 32/64 compatibility.
	 */
	__u64 offset;
};

Why do I think this is confusing? The name itself never quite made sense – what use is there in mapping an object to the GTT? Furthermore, how does mapping it to the GPU allow me to do anything with in from userspace. For one thing, I had confused, “mmap” with, “map.” The former really does identify the recipient (the CPU, not the GPU) of the mapping. If follows the conventional use of mmap(). The other thing is that the interface has an implicit meaning. A GTT map here actually means a GTT mapping within the aperture space. Recall that the aperture is a subset of the GTT which can be accessed through a PCI BAR. Therefore, what this interface actually does is return a token to userspace which can be mmap’d to get the CPU mapping (through the BAR, to the GPU memory). Like I said before, there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call. All of these things will not be discussed.

Conceptualized view of mmap and mmap_gtt

Conceptualized view of mmap and mmap_gtt

Finally, here is a snippet of code from intel-gpu-tools that hopefully just encapsulates what I said and drew.

mmap_arg.handle = handle;
assert(drmIoctl(fd, DRM_IOCTL_I915_GEM_MMAP_GTT, &mmap_arg) == 0);
assert(mmap64(0, OBJECT_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_arg.offset));

Summary

This is how modern Intel GPUs deal with system memory on all platforms without a PPGTT (or if you disable it via module parameter). Although I happily skipped over the parts about tiling, fences, and cache coherency, rest assured that if you understood all of this post, you have a good footing. Going over the HSW docs again for this post, I am really pleased with how much Intel has improved the organization, and clarity. I highly encourage you to go off and read those for any missing pieces.

Please let me know about any bugs, or feature requests in this post. I would be happy to add them as time allows.

Here are links to SVGs of all the images I created. Feel free to use them how you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/06/overview_standard.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/bo_mapped.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/dma_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/aper_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/mmaps.svg

Download PDF

  1. when using the VT-d the address is actually an I/O address rather than the physical address 

  2. Previous gens went to 39 

  3. I have submitted two patch series, one of which has been reverted, the other, never merged, which allow invalid PTEs for debug purposes 

  4. intel_gtt is currently not supported for GEN8+. If someone wants to volunteer to update this tool for gen8, please let me know 

  5. I’ve fought to call this operation, “map” 

  6. Empirically (for me), GEN7+ GPUs have behaved themselves quite well after taking the page fault. I very much believe we should be using this feature as much as possible to help userspace driver developers 

  7. I’ve previously written a post on how this works for Intel 

  8. Sorry people, this one is too far out of scope for and explanation in this post. Just trust it’s a limitation if you don’t understand. Daniel Vetter probably wrote an article about it if you feel like heading over to his blog

  9. There are several distinct caches on all modern GEN GPUs, as well as eDRAM for Intel’s Iris Pro. The combined amount of this “local” memory is actually greater than many earlier discrete GPUs 

June 05, 2014

I don’t know if I’ve ever eaten my own dogfood that smells this risky.

A few days ago, I published patches to support dynamic page table allocation and tear-down in the i915 driver http://lists.freedesktop.org/archives/intel-gfx/2014-March/041814.html. This work will eventually help us support expanded page tables (similar to how things work for normal Linux page tables). The patches rely on using full PPGTT support, which still requires some work to get enabled by default. As a result, I’ll be carrying around this work for quite a while. The patches provide a lot of opportunity to uncover all sorts of weird bugs we’ve never seen due to the more stressful usage of the GPU’s TLBs. To avoid the patches getting too stale, and to further the bug extermination, I’ve figured, why not run it myself?

If you feel like some serious pain, or just want to help me debug it, give it a go – there should be absolutely no visible gain for you, only harm. You can either grab the patches from the mailing list, patchwork, or my branch.  Make sure to turn on full PPGTT support with i915.enable_ppgtt=2. If you do decide to opt for the pain, you can take comfort in the fact that you’re helping get the next big piece of prep work in place.

The question is, how long before I get sick of this terrible dogfood? I’m thinking by Monday I’ll be finished :D

Download PDF
This is a short and vague glimpse to the interfaces that the Linux kernel offers to user space for display and graphics management, from the history to what is hot and new, to what might perhaps be coming after. The topic came current for me when I started preparing Weston for global thermonuclear war.

The pre-history


In the age of dragons, kernel mode setting did not exist. There was only user space mode setting, where the job of the kernel driver (if any) was simply to give user space direct access to the graphics card registers. A user space driver (well, Xorg video DDX, really, err... or what it was at the time of XFree86) would then poke the card registers to set a mode. The kernel had no idea of anything.

The kernel DRM infrastructure was started as an out-of-tree kernel module for cooperating between multiple programs wanting to access the graphics card's resources. Later it was (partially?) merged into the kernel tree (the year is a lie, 2.3.18 came out in 1999), and much much later it was finally deleted from the libdrm repository.

The middle age


For some time, the kernel DRM existed alongside user space mode setting. It was a dark time full of crazy hacks to keep it all together with duct tape, barbwire and luck. GPUs and hardware accelerated OpenGL started to come up.

The new age


With the invent of kernel mode setting (KMS), the DRM kernel drivers got in charge of the graphics card resources: outputs, video modes, memory allocations, hotplug! User space mode setting became obsolete and was eventually killed. The kernel driver was finally actually in control of the graphics hardware.

KMS probably started with just setting the main framebuffer (primary plane) for each "CRTC" and programming the video mode. A CRTC is for "cathode-ray tube controller", but essentially means a block that reads memory (a framebuffer) and produces a bitstream according to video mode timings. The bitstream is directed into an "encoder", which turns it into a proper physical/analogue signal, like VGA or digital DVI. The signal then exits the graphics card though a "connector". CRTC, encoder, and connector are the basic concepts in KMS API. Quite often these can be combined in some restricted ways, like a single CRTC feeding two encoders for clone mode.

Even ancient hardware supported hardware cursors: a small sprite that was composited into the outgoing video signal on the fly, which meant that it was very cheap to move around. Cursor being so special, and often with funny color format (alpha!), got its very own DRM ioctl.

There were also hardware overlays (additional or secondary planes) on some hardware. While the primary framebuffer covers the whole display, an overlay is another buffer (just like the cursor) that gets mixed into the bitstream at the CRTC level. It is like basic compositing done on the scanout hardware level. Overlays usually had additional benefits, for example they could apply scaling or color space conversion (hello, video players) very efficiently. Overlays being different, they too got their very own DRM ioctls.

The KMS user space ABI was anything but atomic. With the X11 tradition, it wasn't too important how to update the displays, as long as the end result eventually was what you wanted. Race conditions in content updates didn't matter too much either, as X was racy as hell anyway. You update the CRTC. Then you update each overlay. You might update the cursor, too. By luck, all these updates could hit the same vblank. Or not. Or you don't hit vblank at all, and get tearing. No big deal, as X was essentially all about front-buffer rendering anyway. (And then there were huge efforts in trying to fix it all up with X, GLX, Mesa and GL-compositors, and avoid tearing, and it ended up complicated.)

With the advent of X compositing managers, that did not play well with the  awkward X11 protocol (Xv) or the hardware overlays, and with rise of the  GPU power and OpenGL, it was thought that hardware overlays would  eventually die out. Turned out the benefits of hardware overlays were too great to abandon, and with Wayland we again have a decent chance to make the most of them while still enjoying compositing.

The global thermonuclear war (named after a git branch by Rob Clark)


The quality of display updates became important. People do not like tearing. Someone actually wanted to update the primary framebuffer and the overlays on the same vblank, guaranteed. And the cursor as the cherry on top.

We needed one ABI to rule them all.

Universal planes brings framebuffers (primary planes), overlays (secondary planes) and cursors (cursor planes) together under the same API. No more type specific ioctls, but common ioctls shared by them all. As these objects are still somewhat different, overlays having wildly differing features and vendors wanting to expose their own stuff, object properties were invented.

An object property is essentially a {key, value} pair. In the API, the name of a key is a string. Each object has its own set of keys. To use a key, you must know it by name, fetch the handle, and then use the handle when setting the value. Handles seem to be per-object, so make sure to fetch them separately for each.

Atomic mode setting and nuclear pageflip are two sides of the same feature. Atomicity is achieved by gathering a set of property changes, and then pushing them all into the kernel in a single ioctl call. Then that call either succeeds or fails as a whole. Libdrm offers a drmModePropertySet for gathering the changes. Everything is exposed as properties: the attached FB, overlay position, video mode, etc.

Atomic mode setting means setting the output modes of a single graphics device, more or less. Devices may have hard to express limitations. A simple example is the available scanout memory bandwidth: You can drive either two mid-resolution outputs, or one high-resolution output. Or maybe some crtc-encoder-connector combination is not possible with a particular other combination for another output. Collecting the video mode, encoder and connector setup over the whole grahics card into a single operation avoids flicker. Either the whole set succeeds, or it fails. Without atomic mode setting, changing multiple outputs would not only take longer, but if some step failed, you'd have to undo all earlier steps (and hope the undo steps don't fail). Plus, there would be no way to easily test if a certain combination is possible. Atomic mode setting fixes all this.

Nuclear pageflip is about synchronizing the update of a single output (monitor) and making that atomic. This means that when user space wants to update the primary framebuffer, move the cursor, and update a couple of overlays, all those changes happen at the same vblank. Again it all either succeeds or fails. "Every frame is perfect."

And then there shall be ponies (at the end of the rainbow)


Once the global thermonuclear war is over, we have the perfect ABI for driving display updates.

Well, almost. Enter NVidia G-Sync, or AMD's FreeSync which is actually backed by a VESA standard. Dynamically variable refresh rate. We have no way yet for timing display updates in DRM. All we can do is kick out a display update, and it will hopefully land on the next vblank, whenever that is. But we can't tell the DRM when we would like it to be. Everything so far assumes, that the display refresh rate is a constant, apart from an explicit mode switch. Though I have heard that e.g. Chrome for Intel (i915, LVDS/eDP reclocking) has some hacks that opportunistically drops the refresh rate to save power.

There is also a culprit in the DRM of today (Jun 3rd, 2014). You can schedule a pageflip, but if you have pending rendering on that framebuffer for the same GPU as were you are presenting it, the pageflip will not happen until the rendering completes. And you do not know when it will complete, which means you do not know if you will hit the very next vblank or something later.

If the rendering GPU is not the same graphics device that presents the framebuffer, you do not get synchronization at all. That means that you may be scanning out an incomplete rendering for a frame or two, or you have to stall the GPU to make sure it is done before scheduling the page flip. This should be fixed with the fences related to dma-bufs (Hi, Maarten Lankhorst).

And so the unicorn keeps on running.
May 30, 2014

Last week was the OpenStack Design Summit in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments.

The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me!

On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week.

Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our Etherpads instances. Here is a short summary of the sessions I've attended.

Scaling the central agent

I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the tooz library that we worked on at eNovance during the last 6 months.

Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno.

For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted.

Test strategy

The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet.

Complex queries and per-user/project data collection

This session, led by Ildikó Váncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API.

Rethinking Ceilometer as a Time-Series-as-a-Service

This was my main session, the reason we met on Monday for a few hours, and one of the most promising session – I hope – of the week.

It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles.

Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called Gnocchi, which provides exactly that functionality in a hopefully scalable way.

Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on Pandas and Swift. The canonical resource indexer driver is based on SQLAlchemy.

The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release.

Revisiting the Ceilometer data model

This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way.

Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver.

Ceilometer devops session

We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts.

SNMP inspectors

This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer.

Alarm and logs improvements

This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems.

Conclusion

Considering the current QA problems with Ceilometer, Eoghan Glynn, the new Project Technical Leader for Ceilometer, clearly indicated that this will be the main focus of the release cycle.

Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.

May 29, 2014

I spent last weekend in Beijing attending GNOME Asia 2014; yeah, long trip from Europe just for 3 days, but it was totally worth it. The worst part of it was of course fighting jet lag when I arrived, and fighting it again 3 days later when I came back to Spain :) The conference was really well organized [1], so kudos to all the local team!

After a quick sleep on Friday morning, I attended the development and documentation training sessions that Kat, André and Dave gave. They were quite interesting, especially since I’m not involved in the real user documentation that GNOME provides. I have to say that these guys do an amazing job, not only teaching during conferences, but also through the whole year.

There are, from my point of view, two main ways of learning new things:

  • The ‘engineer’ way: Learning things as you need them, what you would do when you start writing an application and looking for examples of how to do what you want to do (autotools, anyone?). It is a very ‘engineer’ way, as you pick black boxes that you’ll use to build something bigger, while not fully understanding what the black box does inside.
  • The ‘scientific’ way: When you learn something in order to fully understand it and be able to teach others. This approach takes a lot longer, as you need to make sure that everything you learn is accurate and you end up questioning the things that are not clear enough. Learning stuff to teach others is actually what you do in University; you’re learning things that will afterwards need to be explained in an exam to someone who knows more about the subject than you do.

Sure, both ways have their ups and downs, but if you want to write software you need to be able to switch between those two mindsets constantly. You’ll use the engineer way when reading API docs, looking for the bits and pieces that you need to build your stuff. You’ll use the ‘scientific’ way when you need to start learning a new technology, or when you need more detail on how to do things. While the API docs are taken care of by the library developers, it is the documentation team the one making sure that user guides, tutorials, and other developer resources are kept up to date, which are definitely some of the toughest and most important tasks done to help newcomers and other developers. So go on, go learn GNOME technologies and teach others, join the documentation team! ;)

GNOME Asia is not a usual conference. If you have attended a Desktop Summit, GUADEC or FOSDEM before, all those conferences are built by developers and for developers. The focus of those conferences is usually not (explicitly) to attract newcomers, but instead to be a show of the latest and shiniest things happening in the projects. Of course we also had part of that in Beijing, say Lennart’s talk about the status of systemd or Allan’s talk about application bundles. Those both were very good talks, but likely too specific for most of the audience. Instead, I chose to talk about something more basic, focused on attracting newcomers wanting to write applications, and so I gave an Introduction to D-Bus talk, including some examples. It is the same talk I gave last year in GUADEC-ES, but in English this time (my Mandarin is not good enough).

I would like to thank the GNOME Foundation for sponsoring the flight to Beijing, and of course to all the local team who, as I already said, did an amazing job.

sponsored-badge-simple

[1] …except for the tea-less tea-breaks ;)


Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Meetings, Planets Tagged: dbus, gnome, gnome-asia

Recently Philip decided it was time to call for some attention.I happen to agree with him on the need to focus on developer experience, that's why I organized the first hackfest on this topic last year and attended this year. There are plenty of conversations around this and Philip, if you care so much, maybe you could attend or help, there's a lot to do and so few hands.

I've been asked to remove your blog by several people and I've reached the conclusion that it would be a really bad idea because it would set the wrong precedence and it would shift the discussion to the wrong topic (censorship yadda yadda). Questioning OPW should be allowed. The problem with your post is that if not questioned by other people (as many have done already) it would send the wrong message to the public and prospect GSoC, OPW and general contributors. Your blog was the wrong place to question and your wording makes it clear that you have misunderstandings about how the community works.

You want to make things better? Why don't you start by learning how to work with others and contributing yourself? You think we need better leadership? Why don't you learn what it takes to become a leader? (hint: your blog post doesn't help)

Perhaps your lack of contact with the overall project and your abscence from most events makes you not realize how possitive OPW has been, OPW has been a lot more successful than GSoC in retaining contributors and bringing diversity to our contributor base (and I don't mean gender diversity, but diversity in the nature of those contributions). I happen to have a pretty good picture of this because I get to manage the blogs of the people who stay and the people who leave. Without OPW GNOME would be worse community wise and project wise and this is not an opinion, it is a _hard data_ backed fact (other posts have enumerated the contributions that would have not happened otherwise so I will not do that here).

There are plenty of questions that I think are healthy to ask: for how long do we do OPW? Is its success only due to it being targetted to women or is it successful for something else? You should have a conversation with Marina and other people involved with OPW and gather an understanding before making assumptions and throwing assertions. And you should respect what people chooses to do within the project, it's their goddamn time after all. In open source no one gets to dictate what nobody does (though alignment is always good if can be achieved), people work in what they think its important and they try to do it together.

I think you should also watch this video, it might give you some understanding on why GNOME is as responsible for equality as any other entity.

May 28, 2014

A common error when building from source is something like the error below:


configure: error: Package requirements (foo) were not met:

No package 'foo' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.
Seeing that can be quite discouraging, but luckily, in many cases it's not too difficult to fix. As usual, there are many ways to get to a successful result, I'll describe what I consider the simplest.

What does it mean?

pkg-config is a tool that provides compiler flags, library dependencies and a couple of other things to correctly link to external libraries. For more details on it see Dan Nicholson's guide. If a build system requires a package foo, pkg-config searches for a file foo.pc in the following directories: /usr/lib/pkgconfig, /usr/lib64/pkgconfig, /usr/share/pkgconfig, /usr/local/lib/pkgconfig, /usr/local/share/pkgconfig. The error message simply means pkg-config couldn't find the file and you need to install the matching package from your distribution or from source.

What package provides the foo.pc file?

In many cases the package is the development version of the package name. Try foo-devel (Fedora, RHEL, SuSE, ...) or foo-dev (Debian, Ubuntu, ...). yum provides a great shortcut to install any pkg-config dependency:


$> yum install "pkgconfig(foo)"
will automatically search and install the right package, including its dependencies.
apt-get requires a bit more effort:

$> apt-get install apt-file
$> apt-file update
$> apt-file search --package-only foo.pc
foo-dev
$> apt-get install foo-dev
For those running Arch and pacman, the sequence is:

$> pacman -S pkgfile
$> pkgfile -u
$> pkgfile foo.pc
extra/foo
$> pacman -S extra/foo
zypper is the same as yum:

$> zypper in 'pkgconfig(foo)'
Once that's done you can re-run configure and see if all dependencies have been met. If more packages are missing, follow the same process for the next file.

Any users of other distributions - let me know how to do this on yours and I'll update the post

Where does the dependency come from?

In most projects using autotools the dependency is specified in the file configure.ac and looks roughly like one of these:


PKG_CHECK_MODULES(FOO, [foo])
PKG_CHECK_MODULES(DEPENDENCIES, foo [bar >= 1.4] banana)
The first argument is simple a name that is used in the build system, you can ingore it. After the comma is the list of space-separated dependencies. In this case this means we need foo.pc, bar.pc and banana.pc, and more specifically we need a bar.pc that is equal or newer to version 1.4 of the package. To install all three follow the above steps and you're good.

My version is wrong!

It's not uncommon to see the following error after installing the right package:


configure: error: Package requirements (foo >= 1.9) were not met:

Requested 'foo >= 1.9' but version of foo is 1.8

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.
Now you're stuck and you have a problem. What this means is that the package version your distribution provides is not new enough to build your software. This is where the simple solutions and and it all gets a bit more complicated - with more potential errors. Unless you are willing to go into the deep end, I recommend moving on and accepting that you can't have the newest bits on an older distribution. Because now you have to build the dependencies from source and that may then require to build their dependencies from source and before you know you've built 30 packages. If you're willing read on, otherwise - sorry, you won't be able to run your software today.

Manually installing dependencies

Now you're in the deep end, so be aware that you may see more complicated errors in the process. First of all you need to figure out where to get the source from. I'll now use cairo as example instead of foo so you see actual data. On rpm-based distributions like Fedora run:


$> yum info cairo-devel
Loaded plugins: auto-update-debuginfo, langpacks
Skipping unreadable repository '///etc/yum.repos.d/SpiderOak-stable.repo'
Installed Packages
Name : cairo-devel
Arch : x86_64
Version : 1.13.1
Release : 0.1.git337ab1f.fc20
Size : 2.4 M
Repo : installed
From repo : fedora
Summary : Development files for cairo
URL : http://cairographics.org
License : LGPLv2 or MPLv1.1
Description : Cairo is a 2D graphics library designed to provide high-quality
: display and print output.
:
: This package contains libraries, header files and developer
: documentation needed for developing software which uses the cairo
: graphics library.
The important field here is the URL line - got to that and you'll find the source tarballs. That should be true for most projects but you may need to google for the package name and hope. Search for the tarball with the right version number and download it. On Debian and related distributions, cairo is provided by the libcairo2-dev package. Run apt-cache show on that package:

$> apt-cache show libcairo2-dev
Package: libcairo2-dev
Source: cairo
Version: 1.12.2-3
Installed-Size: 2766
Maintainer: Dave Beckett
Architecture: amd64
Provides: libcairo-dev
Depends: libcairo2 (= 1.12.2-3), libcairo-gobject2 (= 1.12.2-3),[...]
Suggests: libcairo2-doc
Description-en: Development files for the Cairo 2D graphics library
Cairo is a multi-platform library providing anti-aliased
vector-based rendering for multiple target backends.
.
This package contains the development libraries, header files needed by
programs that want to compile with Cairo.
Homepage: http://cairographics.org/
Description-md5: 07fe86d11452aa2efc887db335b46f58
Tag: devel::library, role::devel-lib, uitoolkit::gtk
Section: libdevel
Priority: optional
Filename: pool/main/c/cairo/libcairo2-dev_1.12.2-3_amd64.deb
Size: 1160286
MD5sum: e29852ae8e8e5510b00b13dbc201ce66
SHA1: 2ed3534d02c01b8d10b13748c3a02820d10962cf
SHA256: a6099cfbcc6bd891e347dd9abc57b7f137e0fd619deaff39606fd58f0cc60d27
In this case it's the Homepage line that matters, but the process of downloading tarballs is the same as above. For Arch users, the interesting line is URL as well:

$> pacman -Si cairo | grep URL
Repository : extra
Name : cairo
Version : 1.12.16-1
Description : Cairo vector graphics library
Architecture : x86_64
URL : http://cairographics.org/
Licenses : LGPL MPL
....
zypper (Tizen, SailfishOS, Meego and others) doesn't have an interface for this, but you can run rpm on the package that you installed.

$> rpm -qi cairo-devel
Name : cairo-devel
[...]
URL : http://cairographics.org/
This command would obviously work on other rpm-based distributions too (Fedora, RHEL, ...). Unlike yum, it does require the package to be installed but by the time you get here you've already installed it anyway :)

Now to the complicated bit: In most cases, you shouldn't install the new version over the system version because you may break other things. You're better off installing the dependency into a custom folder ("prefix") and point pkg-config to it. So let's say you downloaded the cairo tarball, now you need to run:


$> mkdir $HOME/dependencies/
$> tar xf cairo-someversion.tar.xz
$> cd cairo-someversion
$> autoreconf -ivf
$> ./configure --prefix=$HOME/dependencies
$> make && make install
$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig
# now go back to original project and run configure again
So you create a directory called dependencies, install cairo there. This will install cairo.pc as $HOME/dependencies/lib/cairo.pc. Now all you need to do is tell pkg-config that you want it to look there as well - so you set PKG_CONFIG_PATH. If you re-run configure in the original project, pkg-config will find the new version and configure should succeed. If you have multiple packages that all require a newer version, install them into the same path and you only need to set PKG_CONFIG_PATH once. Remember you need to set PKG_CONFIG_PATH in the same shell as you are running configure from.

If you keep seeing the version error the most common problem is that PKG_CONFIG_PATH isn't set in your shell, or doesn't point to the new cairo.pc file. A simple way to check is:


$> pkg-config --modversion cairo
1.13.1
Is the version number the one you installed or the system one? If it is the system one, you have a typo in PKG_CONFIG_PATH, just re-set it. If it still doesn't work do this:

$> cat $HOME/dependencies/lib/pkgconfig/cairo.pc
prefix=/usr
exec_prefix=/usr
libdir=/usr/lib64
includedir=/usr/include

Name: cairo
Description: Multi-platform 2D graphics library
Version: 1.13.1

Requires.private: gobject-2.0 glib-2.0 >= 2.14 [...]
Libs: -L${libdir} -lcairo
Libs.private: -lz -lz -lGL
Cflags: -I${includedir}/cairo
If the Version field matches what pkg-config returns, then you're set. If not, keep adjusting PKG_CONFIG_PATH until it works. There is a rare case where the Version field in the installed library doesn't match what the tarball said. That's a defective tarball and you should report this to the project, but don't worry, this hardly ever happens. In almost all cases, the cause is simply PKG_CONFIG_PATH not being set correctly. Keep trying :)

Let's assume you've managed to build the dependencies and want to run the newly built project. The only problem is: because you built against a newer library than the one on your system, you need to point it to use the new libraries.


$> export LD_LIBRARY_PATH=$HOME/dependencies/lib
and now you can, in the same shell, run your project.

Good luck!

May 26, 2014

And again, another KDE-AppStream post ;-) If you want to know more about AppStream metadata and why adding it to your project is a good idea, you might be interested in this blogpost (and several previous ones I wrote).

Originally, my plan was to directly push metadata to most KDE projects. The problem is that there is no way to reach all maintainers and have them opt-out for getting metadata pushed to their repositories. There is also no technical policy for a KDE project, since “KDE” is really only about the community right now, and there are no technical criteria a project under the KDE umbrella has to fulfill (at least to my knowledge, in theory, even GTK+ projects are perfectly fine within KDE).

Since I feel very uncomfortable in touching other people’s repositories without sending them a note first, I think the best way forward is an opt-in approach.

So, if you want your KDE project to ship metadata, follow these simple steps:

1. Check if there already is metadata for your project

That’s right – we already have some metadata available. Checkout the kde-appstream-metadata-templates repository at Github. You can take the XML file from there, if you want. Just make sure that there are no invalid tags in the description field (no <a/> nodes allowed, for example – the content is not HTML!), check if you have an SPDX-compliant <project_license/> tag, check if the provided public interfaces in the <provides/> tag match your project and also test if the URLs work.

Then you can copy the modified AppStream metadata to your project.

2. Write new metadata

How to write new metadata is described in detail at this TechBase Wiki page. Just follow the instructions.

In case you need help or want me to push the metadata to your project if you don’t have the time, you can also write me an email: matthias [{AT}] tenstral . net – or alternatively file a bug against the Github project linked above.

Don’t forget to have CMake install your shiny new metadata into /usr/share/appdata/.

All metadata you add to your project will automatically get translated by the KDE l10n scripts, no further action is required. So far, projects like Ark, K3b and Calligra are shipping metadata, and the number of AppStream-aware projects in KDE is growing constantly, which greatly improves their visibility in software centers, and will help distributions a lot in organizing installed software.

If you have further questions, please ask! .-)

Since I've started maintaining the drm/i915 over a year ago I've fined-tuned the branching model a lot. So I think it's time for an update, especially since Sedat Dilek poked me about it.

There are two main branches I merge patches into, namely drm-intel-fixes and drm-intel-next-queued. drm-intel-fixes contains bugfixes for the current -rc upstream kernels, whereas drm-intel-next-queued (commonly shortened to dinq) is for feature work. The first special thing is that contrary to subsystem trees I don't close dinq while the upstream merge window is open - stalling QA and our own driver feature and review work for usually more than 2 weeks is a bit too disruptive. But the linux-next tree maintained by upstream forbids this, hence why I also have the for-linux-next branch: Usually it's a copy of the drm-intel-next-queued branch safe when the merge window is open, then it's a copy of the drm-intel-fixes branch. This way we can keep the process in our team going without upsetting Stephen Rothwell.
Usually once the merge-window opens I have a mix of feature work (which then obviously missed the deadline) and bugfixes oustanding (i.e. not yet merged to Dave Airlie's drm branches). Therefore I split out the bugfixes from dinq into -fixes and rebase the remaining feature work in dinq on top of that.

Now for our own nightly QA testing we want a lightweight linux-next just for graphics. Hence there's the drm-intel-nightly branch which currently merges together drm-intel-fixes and -next-queued plus the drm-next and drm-fixes branches from Dave Airlie. We've had some ugly regressions in the past due to changes in the drm core which we've failed to detect for months, hence why we now also include the drm branches.

Unfortunately not all of our QA procedures are fully automated and can be run every night, so roughly every two weeks I try to stabilize dinq a bit and freeze it as drm-intel-next. I also push out a snapshot of -nightly to drm-intel-testing, so that QA has a tree with all the bugfixes included to do the extensive and labour-intensive manual tests. Once that's done (usually takes about a week) and nothing bad popped up I send a pull request to Dave Airlie for that drm-intel-next branch.

There's also still the for-airlied branch around, but since I now use git tags both for -fixes and -next pull requests I'll probably remove that branch soon.
Finally there's the rerere-cache branch. I have a bunch of scripts which automatically create drm-intel-nightly for me, using git rerere and a set of optional fixup patches. To be able to seamlessly do both at home with my workstation and on the road with my laptop the scripts also push out the latest git rerere data and fixup branches to this branch. When I switch machines I can then quickly sync up all the maintainer branch state to the next machine with a simple script.

Now if you're a developer and wonder on which branch you should base your patches on it's pretty simple: If it's a bugfix which should be merged into the current -rc series it should be based on top of drm-intel-fixes. Feature work on the other hand should be based on drm-intel-nightly - if the patches don't apply cleanly to plain dinq I'll look at the situation and decide whether an explicit backport or just resolving the conflicts is the right approach. Upstream maintainers don't like backmerges, hence I try to avoid them if it makes sense. But that also means that dinq and -fixes can diverge quite a bit and that some of the patches in -fixes would break things on plain dinq in interesting ways. Hence the recommendation to base feature development on top of -nightly.

Addendum: All these branches can be found in the drm-intel git repository.

Update: The drm-intel repository move, so I've updated the link.
May 23, 2014
So I've created a copr

http://copr.fedoraproject.org/coprs/airlied/mst/

With a kernel + intel driver that should provide support for DisplayPort MST on Intel Haswell hardware. It doesn't do any of the fancy Dell monitor stuff yet, it primarily for people who have Lenovo or Dell docks and laptops that can't currently multihead.

The kernel source is from this branch which backports a chunk of stuff to v3.14 to support this.

http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-i915-mst-v3.14

It might still have some bugs and crashes, but the basics should in theory work.
May 19, 2014

I've been hacking on a little tool the last couple of days and I think it's ready for others to look at it and provide suggestions to improve it. Or possibly even tell me that it already exists, in which case I'll save a lot of time. "tellme" is a simple tool that uses text-to-speech to let me know when a command finished. This is useful for commands that run for a couple of minutes - you can go off read something and the computer tells you when it's done instead of you polling every couple of seconds to check. A simple example:


tellme sudo yum update
runs yum update, and eventually says in a beautiful totally-not-computer-sounding voice "finished yum update successfully".

That was the first incarnation which was a shell script, I've started putting a few more features in (now in Python) and it now supports per-command configuration and a couple of other semi-smart things. For example:


whot@yabbi:~/xorg/xserver/Xi> tellme make
eventually says "finished xserver make successfully". With the default make configuration, it runs up the tree to search for a .git directory and then uses that as basename for the voice output. Which is useful when you rebuild all drivers simultaneously and the box tells you which ones finished and whether there was an error.

I put it up on github: https://github.com/whot/tellme. It's still quite rough, but workable. Have a play with it and feel free to send me suggestions.

May 18, 2014

Hello,

The Google Summer of Code 2014 coding period starts tomorrow. This year, my project is to expose NVIDIA’s GPU graphics counter to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA’s performance counters.

The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications. At the end of this GSoC, NVIDIA’s GPU graphics counter for GeForce 8, 9 and 2XX (nv50/tesla) will (almost-all) be exposed for Nouveau. Some counters won’t be available until the compute support (ie. the ability to launch kernels) for nv50 is not implemented.

During the past weeks, I continued to reverse engineering NVIDIA’s graphics counter for nv50 until now. Currently, the documentation is almost complete (except for aa, ac and af because I don’t have them), and recently I started this process for nvc0 cards. At the moment this documentation hasn’t been pushed to envytools and it is only available in my personal repository.

For checking the reverse engineered configuration of the performance counters, I developed a modified version of OGLPerfHarness (the OpenGL sample code of NVPerfKit). This OpenGL sample automatically monitors and exports values of performance counters by using NVPerfSDK on Windows. The figure below shows an example.

openglharness-screenshot

This tool is called (using a bash script) for all available counters and it produces the following output (for shader_busy signal in this example) :

OPTIONS:
model=bunny
model-count=27
render-mode=vbo
texture=small
num-frames=100
fullscreen=no
STATS:
fps=9.53
mean=98.5%
min=98.5%
max=98.6%

All stats produced by the OpenGL sample are available in my repo. However, I didn’t publish the code because I don’t have the right to redistribute it, but I can send a patch if anyone is interested.

For checking the configuration of these performance counters on Nouveau, I ported my tool to Linux. Then, I was able to compare values exported from Windows using nv_perfmon for monitoring counters.

Now, the plan for the next weeks is to work on the kernel ioctls interface.

See you later!


May 17, 2014
I've just pushed to upstream mesa support for occlusion query, which means that freedreno now advertises OpenGL 2.0:


OpenGL vendor string: freedreno
OpenGL renderer string: Gallium 0.4 on FD320
OpenGL version string: 2.0 Mesa 10.3.0-devel (git-00fcf8b)
OpenGL shading language version string: 1.20

Note that this is desktop OpenGL.  Freedreno has supported OpenGLES 2.0 for quite a long time now.

Implementing occlusion query was a bit interesting due to the way the tiling works on adreno.  We have to track query results per tile.  I've written up a bit of a description about how it works on the wiki: Hardware Queries

Looks like next up is sRGB support which gets us up to GL 2.1.  And then the fun begins with work on GL/GLES 3.0 :-)

EDIT: turns out sRGB texture support is pretty easy.  So now we are GL 2.1.  (GL/GLES 3.0 also needs sRGB render target support which is a bit more involved.  But there that is just one of several features needed for 3.0).


May 15, 2014

Title

Expose NVIDIA’s GPU graphics counters to the userspace.

Short description

This project aims to expose NVIDIA’s GPU graphics counters to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA’s performance counters. The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications.

Personal information

I’m a student in his final year of a MSc degree at the university of Bordeaux,
France. I already participated to the Google Summer of Code last year [1] and
my project was to reverse engineering NVIDIA’s performance counters.

Context

Performance counters

A hardware performance counter is a set of special registers which are used
to store the counts of hardware-related activities. Hardware counters are
oftenly used by developers to identify bottlenecks in their applications.

In this proposal, we are only focusing on NVIDIA’s performance counters.

There are two types of counters offered by NVIDIA which provide data directly
from various points of the GPU. Compute counters are used for OpenCL, while
graphics counters give detailed information for OpenGL/Direct3D.

On Windows, compute and graphics counters are both exposed by PerfKit[2], an
advanced software suite (except when it crashes my computer for no particular
reason :-)), which can be used by advanced users for profiling OpenCL and
Direct3D/OpenGL applications.

On Linux, the proprietary driver *only* exposes compute counters through the
CUDA compute profiler[3] (CUPTI), and not graphics counters like PerfKit which
is only available on Windows.

On Nouveau/Linux, some counters are already exposed. Compute counters for
nvc0/Fermi and nve0/Kepler are available in mesa which manages counters’
allocation and monitoring through some software methods provided by the kernel.

The compute and graphics counters distinction made by NVIDIA is arbitrary and
won’t be present in our re-implementation.

Google Summer of Code 2013 review

I took part in the GSoC 2013 and my project was to reverse engineering NVIDIA’s
performance counters and to expose them via nv_perfmon.

Let me now sum up the important tasks I have done during this project.

The first part I have done was to take a look at cupti to understand how GPU
compute counters are implemented on Fermi. After playing a bit with that
profiler, I wrote a tool named cupti_trace[4] to make the reverse engineering
process as automatic as possible. At this stage, I was able to start the
implementation of MP counters on nvc0/Fermi in mesa, based on the previous work
of Christoph Bumiller (aka calim) who already had implemented that support for
nve0/Kepler. To complete this task, I had to implement parts of the compute
runtime for nvc0 (ie. the ability to launch kernels).

MP compute counters support for Fermi :
http://lists.freedesktop.org/archives/mesa-commit/2013-July/044444.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044573.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044574.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044576.html

The second part of my project was to start reverse engineering graphics
counters on nv50/Tesla through PerfKit and gDEBugger[5], an advanced OpenGL and
OpenCL debugger, profiler and memory analyzer. Knowing that PerfKit was only
available on Windows, I was unable to use envytools[6], a tools suite for
reverse engineering the NVIDIA proprietary driver because it depends on
libpciaccess which was not available on Windows. To complete this
task, I then ported this library by using WinIO in order to use tools provided
by envytools like nvapeek and nvawatch.

libpciaccess support on Windows/Cygwin:
http://hakzsam.wordpress.com/2014/01/28/libpciaccess-has-now-official-support-for-windowscygwin/
http://www.phoronix.com/scan.php?page=news_item&px=MTU4NTU
http://cgit.freedesktop.org/xorg/lib/libpciaccess/commit/?id=6bfccc7ec4f0705595385f6684b6849663f781b4

At the end of this Google Summer of Code, some graphics counters had already been
reverse engineered on nv98/Tesla.

This project has been successfully completed except for the implementation of
graphics counters in nv_perfmon and the reverse engineering of MP counters on
Tesla (regarding the schedule). And it has been a very interesting experience
for me even if that was very hard at the beginning. I’m now able to say that
low level hardware programming on GPU is not a trivial task -:).

After GSoC 2013 until now

From October to January, I didn’t work on Nouveau at all because I was
completely busy by the university work.

In February, I returned to work on the reverse engineering of these graphics
counters, and I mostly completed all the documentation of nv50/Tesla chipsets[7].

Project

Benefits to the community

Help Linux developpers in identifying the performance bottleneck of OpenGL
applications.

Description

Compute counters for nvc0+ are already exposed by Nouveau, but there are still
many performance counters exposed by NVIDIA that are left to be exposed in
Nouveau. Last year, I added compute counters support used by OpenCL and CUDA
for nvc0/Fermi.

Graphics counters are currently only available on Windows, but I reverse
engineered them and the documentation is mostly complete. At the time, nv50,
84, 86, 92, 98, a0, a3 and a5 are documented. In few days, I should be able to
complete this list by adding 94, 96 and a8 chipsets. In this GSoC project, I would like to
expose them in Nouveau but there is some problems between PCOUNTER[8] and MP
counters.

PCOUNTER is the card unit which contains most of the performance counters.
PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a
different source clock and has 255+ input signals that can themselves be the
output of one multiplexer. PCOUNTER uses global counters whereas MP counters
are per-app and context switched like compute counters used for nvc0+.

Actually, these two types of counters are not really independent and may share
some configuration parts, for example, the output of a signal multiplexer.

Because of the issue of shared configuration of global counters (PCOUNTER)
and local counters (MP counters), I think it’s a bad idea to allow monitoring
multiple applications concurrently. To solve this problem, I suggest, at first,
to use a global lock for allowing only one application at a time and
for simplifying the implementation.

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more
than one application is monitoring performance counters at the same time.

Implementation

kernel interface and ioctls

Some performance counters are globals and have to be programmed through MMIO.
They have to be managed by the Linux Kernel using an ioctls interface that are
to be defined.

mesa

Only mesa should directly uses performance counters because it has all the
information to expose them. Mesa is able to allocate and manage MP
counters (per-app) and can also call the Kernel in order to program global
counters via the ioctls interface that will be implemented. At this stage, mesa
will be able to expose them in GL_AMD_performance_monitor and nouveau-perfkit.

GL_AMD_performance_monitor

GL_AMD_performance_monitor[9] is an OpenGL extension which can be used to
capture and report performance counters. This is a great extension for Linux
developers which currently does not report any performance counters from
NVIDIA’s GPU. After having the core implementation in mesa, this task should
not be too harder since I already have a branch[7] of mesa with core support of
GL_AMD_performance_monitor. Thanks to Kenneth Graunke and Christoph Bumiller.

nouveau-perfkit

Nouveau-perfkit will be a Linux/Nouveau version of NVPerfKit. This tool will be based
on mesa’s implementation. nouveau-perfkit will export both GPU graphics
counters (only nv50/Tesla in a first time) and compute counters (nvc0+). To
maintain interoperability with NVIDIA, I am thinking about re-using the
interface of NVidia’s NVPerfkit. This tool will be for nouveau only.

GSoC work

Required tasks:
- core implementation (kernel interface + ioctls + mesa)
- expose graphics counters through GL_AMD_performance_monitor
- add nouveau-perfkit a Linux version of NVPerfkit

Optionnal tasks (if I have the time):
- reverse engineering NVIDIA’s GPU graphics counters for Fermi and Kepler
- all the work which can be done around performance counters

Approximative schedule

(now until 19 May)
- complete the documentation of signals on nv50/tesla
- write OpenGL samples code to test these graphics counters
- test the reverse engineering on Nouveau (mostly done) and write piglit tests
- think more about the core implementation

(19 May until 18 July)
- core implementation of GPU graphics counters
(kernel interface + ioctls + mesa)

(18 July to 28 July)
- expose graphics counters through GL_AMD_performance_monitor

(28 July to 18 August)
- implement nouveau-perfkit based on mesa, which follows nv-perfkit interface

(after GSoC)
- As the last year, I’ll continue to work on Nouveau after the end of this
Google Summer of Code 2014 because I like this job, it’s fun -:).

Thank you for reading. Have a good days.

References

[1] http://hakzsam.wordpress.com/2013/05/27/google-summer-of-code-2013-proposal-for-x-org-foundation/
[2] https://developer.nvidia.com/nvidia-perfkit
[3] http://docs.nvidia.com/cuda/cupti/index.html
[4] https://github.com/hakzsam/re-pcounter-tools/tree/master/src
[5] http://www.gremedy.com/
[6] https://github.com/envytools/envytools
[7] https://github.com/hakzsam/re-pcounter-tools/tree/master/hwdocs/pcounter
[8] https://github.com/envytools/envytools/blob/master/hwdocs/pcounter/intro.rst
[9] https://www.opengl.org/registry/specs/AMD/performance_monitor.txt


May 12, 2014

Two weeks ago I flew to Berlin for the 2nd GNOME Developer Experience hackfest.

The event was quite productive and allowed us to move things forward in different fronts. There has been many blog posts about most of the on goings so I'll stick to what I did.

I joined the discussions about the data model APIs with Ryan and Lars and tried (and failed) to include some new widgets in Glade (GtkStack). Turns out it's not as simple as I thought and that there was already a patch in the works for this so I gave that bullet up.

Later on I sat down with Tomeu to try to revive my former attempts to revamp the API reference documentation UI a bit. He managed to produce a set of JSON files I could use to produce the UI from. My intention is to create a web frontend that works mostly client-side and is off-line friendly (pre fetching all the content) for later use. This could potentially enable a DevHelp replacement through a Web (Epiphany) App.

I have some code working using jQuery, I would like to eventually switch to AngularJS as it has built-in templating support, currently I am generating quite a bit of HTML myself and that would eventually become a maintenance/styling nightmare.

May 07, 2014
Tomorrow I'll be travelling to LinuxTag in Berlin for the first time. Should be pretty cool, and to top it off I'll give a presentation about the state of the intel kernel graphics driver. For those that can't attend I've uploaded the slides already, and if there's a video cut I'll link to that as soon as it's available.

As promised, today I would like to write a bit about the making of The Hacker's Guide to Python. It has been a very interesting experimentation, and I think it is worth sharing it with you.

The inspiration

All started out at the beginning of August 2013. I was spending my summer, as the rest of the year, hacking on OpenStack.

As years passed, I got more and more deeply involved in the various tools that we either built or contributed to within the OpenStack community. And I somehow got the feeling that my experience with Python, the way we used it inside OpenStack and other applications during these last years was worth sharing. Worth writing something bigger than a few blog posts.

The OpenStack project is doing code reviews, and therefore so did I for almost two years. That inspired a lot of topics, like the definitive guide to method decorators that I wrote at the time I started the hacker's guide. Stumbling upon the same mistakes or misunderstanding over and over is, somehow, inspiring.

I also stumbled upon Nathan Barry's blog and book Authority which were very helpful to get started and some sort of guideline.

All of that brought me enough ideas to start writing a book about Python software development for people already familiar with the language.

The writing

The first thing I started to do is to list all the topics I wanted to write about. The list turned out to have subjects that had no direct interest for a practical guide. For example, on one hand, very few developers know in details how metaclasses work, but on the other hand, I never had to write a metaclass during these last years. That's the kind of subject I decided not to write about, dropped all subjects that I felt were not going to help my reader to be more productive. Even if they could be technically interesting.

Then, I gathered all problems I saw during the code reviews I did during these last two years. Some of them I only recalled in the days following the beginning of that project. But I kept adding them to the table of contents, reorganizing stuff as needed.

After a couple of weeks, I had a pretty good overview of the contents that there I will write about. All I had to do was to fill in the blank (that sounds so simple now).

The entire writing of the took hundred hours spread from August to November, during my spare time. I had to stop all my other side projects for that.

The interviews

While writing the book, I tried to parallelize every thing I could. That included asking people for interviews to be included in the book. I already had a pretty good list of the people I wanted to feature in the book, so I took some time as soon as possible to ask them, and send them detailed questions.

I discovered two categories of interviewees. Some of them were very fast to answer (≤ 1 week), and others were much, much slower. A couple of them even set up Git repositories to answer the questions, because that probably looked like an entire project to them. :-) So I had to not lose sight and kindly ask from time to time if everything was alright, and at some point started to kindly set some deadline.

In the end, the quality of the answers was awesome, and I like to think that was because I picked the right people!

The proof-reading

Once the book was finished, I somehow needed to have people proof-reading it. This was probably the hardest part of this experiment. I needed two different types of reviews: technical reviews, to check that the content was correct and interesting, and language review. That one is even more important since English is not my native language.

Finding technical reviewers seemed easy at first, as I had ton of contacts that I identified as being able to review the book. I started by asking a few people if they would be comfortable reading a simple chapter and giving me feedbacks. I started to do that in September: having the writing and the reviews done in parallel was important to me in order to minimize latency and the book's release delay.

All people I contacted answered positively that they would be interested in doing a technical review of a chapter. So I started to send chapters to them. But in the end, only 20% replied back. And even after that, a large portion stopped reviewing after a couple of chapters.

Don't get me wrong: you can't be mad at people not wanting to spend their spare time in book edition like you do.

However, from the few people that gave their time to review a few chapters, I got tremendous feedback, at all level. That's something that was very important and that helped a lot getting confident. Writing a book alone for months without having anyone taking a look upon your shoulder can make you doubt that you are creating something worth it.

As far as English proof-reading, I went ahead and used ODesk to recruit a professional proof-reader. I looked for people with the right skills: a good English level (being a native English speaker at least), be able to understand what the book was about, and being able to work with correct delays. I had mixed results from the people I hired, but I guess that's normal. The only error I made was not to parallelize those reviews enough, so I probably lost a couple of months on that.

The toolchain

While writing the book, I did a few breaks to build a toolchain. What I call a toolchain is set of tools used to render the final PDF, EPUB and MOBI files of the guide.

After some research, I decided to settle on AsciiDoc, using the DocBook output, which is then being transformed to LaTeX, and then to PDF, or either to EPUB directly. I realy on Calibre to convert the EPUB file to MOBI. It took me a few hours to do what I wanted, using some magic LaTeX tricks to have a proper render, but it was worth it and I'm particularly happy with the result.

For the cover design, I asked my talented friend Nicolas to do something for me, and he designed the wonderful cover and its little snake!

The publishing

Publishing in an interesting topic people kept asking me about. This is what I had to answer a few dozens of time:

  • "Who is your editor?"
  • "Me."

I never had any plan for asking an editor to publish this book. Nowadays, asking an editor to publish a book feels to me like asking a major company to publish a CD. It feels awkward.

However, don't get me wrong: there can be a few upsides of having an editor. They will find reviewers and review your book for you. Having the book review handled for you is probably a very good thing, considering how it was hard to me to get that in place. It can be especially important for a technical book.

Also, your book may end up in brick and mortar stores and be part of a collection, both improving visibility. That may improve your book's selling, though the editor and all the intermediaries are going to keep the largest amount of the money anyway.

  • "Oh, you will publish it yourself, great. So you will print it and sell it to people?"
  • "Not really."

I've heard good stories about people using Gumroad to sell electronic contents, so after looking for competitors in that market, I picked them. I also had the idea to sell the book with Bitcoins, so I settled on Coinbase, because they have a nice API to do that.

Setting up everything was quite straight-forward, especially with Gumroad. It only took me a few hours to do so. Writing the Coinbase application took a few hours too.

  • "Oh, you will sell it only as an ebook? That's too bad. You need a paper version. Many people will want a paper version."

My initial plan was to only sell online an electronic version. On the other hand, since I kept hearing that a printed version should exist, I decided to give it a try. I chose to work with Lulu because I knew people using it, and it was pretty simple to set up.

The launch

Once I had everything ready, I built the selling page and connected everything between Mailchimp, Gumroad, Coinbase, Google Analytics, etc.

Writing the launch email was really exciting. I used Mailchimp feature to send the launch mail in several batches, just to have some margin in case of a sudden last minute problem. But everything went fine. Hurrah!

I distributed around 200 copies of the ebook in the first 48 hours, for about $5000. That covered all the cost I had from the writing the book, and even more, so I was already pretty happy with the launch.

Retrospective

In retrospect, something that I didn't do the best way possible is probably to build a solid mailing list of people interested, and to build an important anticipation and incentive to buy the book at launch date. My mailing list counted around 1500 people subscribed because they were interested in the launch of the book subscribed; in the end, probably only 10-15% of them bought the book during the launch, which is probably a bit lower than what I could expect.

But more than a month later, I distributed in total almost 500 copies of the book (including physical units) for more than $10000, so I tend to think that this was a success. I still sell a few copies of the book each weeks, but the number are small compared to the launch.

I sold less than 10 copies of the ebook using Bitcoins, and I admit I'm a bit disappointed and surprised about that.

Physical copies represent 10% of the book distribution. It's probably a lot lower than most people that pushed me to do it thought it would be. But it is still higher of what I thought it would be. So I still would advise to have a paperback version of your book. At least because it's nice to have it in your library.

I only got positive feedbacks, a few typo notices, and absolutely no refund demand, which I really find amazing.

The good news is also that I've been contacted with a couple of Korean and Chinese editors to get the book translated and published in those countries. If everything goes well, the book should be translated in the upcoming months and be available on these markets in 2015!

If you didn't get a copy, it's still time to do so!

May 03, 2014

kde-appstreamIt took some time, but now it’s finally done: KDE has translations for AppStream upstream metadata!

AppStream is a Freedesktop project to extend metadata about the software projects which is available in distributions, especially regarding applications. Distributions compile a metadata file from data collected from packages, .desktop files and possibly other information sources, and create an AppStream XML file from it, which is then – directly or via a Xapian cache – read by software-center-like applications such as GNOME-Software or KDEs Apper.

Since the metadata available from current sources is not standardized and rather poor, upstream projects can ship small XML files, AppStream upstream metadata or AppData in short. These files contain additional information about a project, such as a long description and links to screenshots. They also provide hints about public interfaces a software provides, for example binaries and libraries, making it possible for distributors to give users exactly the right package name in case they are missing a software component.

So, in order to represent graphical KDE applications like they deserve it in the new software centers making use of AppStream, we need to ship AppData files, with long descriptions, screenshots and a few URLs.

But how can you create these metadata files? In case you want your graphical KDE app to ship an AppData file, there is now a help page on the Techbase Wiki which provides all information needed to get started!

For non-visual stuff or software which just wants to publish it’s provided interfaces with AppStream metadata, there is a dedicated page for that as well. Shipping metadata for non-GUI apps will help programmers to satisfy depedencies in order to compile new software, enhance bash-completion for missing binaries and provides some other neat stuff (take a look at this blogpost to get a taste of it).

And if you want to read a FAQ about the metadata stuff and get the bigger picture, just go to the Techbase Wiki page about AppStream metadata as well.

The pages are not 100% final, so if you have questions, please write me a mail and I’ll update the pages, or simply correct/refine it by yourself (it’s a wiki afterall).

And now to the best thing: As soon as you ship an AppStream upstream metadata file (*.appdata.xml* for apps or *.metainfo.xml* for other stuff), the KDE l10n-script (Scripty!) will automatically start translating it, just like we already do with .desktop files. No further actions are necessary.

I already have a large amount of metadata files here, partially auto-generated, which show that we have about 160+ applications in KDE which could get an AppData file, not counting any frameworks or other non-GUI stuff yet. Since that is a bit much to submit via Reviewboard (which I originally planned to do), I hope I can commit the changes directly to the respective repositories, where the maintainers can take a look at it and adjust it to their liking. If that idea does not receive approval, I will just publish a set of data at some place for the KDE app maintainers to take as a reference (the auto-generated stuff needs some fixup to be commit-ready (which I’d do in case I can just commit changes)). Either way, it is safe now to write and ship AppData files in KDE projects!

In order to get your stuff translated, it is necessary that you follow the AppStream 0.6 metadata specification, and not one of ther older revisions. You can easily detect 0.6 metadata by the <component> root node, instead of <application>, or by it having a metadata_license tag. We don’t support the older versions simply because it’s not necessary, as there were only two KDE projects shipping AppData before, which are now using 0.6 data as well. Since 0.6, the metadata XML format is guaranteed to be stable, and the only reason which could make me change it in an incompatible way is to prevent something as bad as the end of the world from happening (== won’t happen ;-) ). You can find the full specification (upstream and distro data) here.

All parsers are able to handle 0.6 data now, and the existing tools are almost all migrated already (might take a few months to hit the distributions though).

So, happy metadata-writing! :-)

Thanks to all people who helped with making this happen, and especially Burkhard Lück and Albert Astals Cid for their patch-review and help with integrating the necessary changes into the KDE l10n-script.

May 01, 2014

I’m at the GNOME Development Experience Hackfest in Berlin, and one of the things that I wanted to target during these days was to keep on looking at how we can enable different profiles in Devhelp.

devhelp-profiles

As you probably know, Devhelp will show you the documentation of libraries installed in your system (usually only if you have the -devel or -docs package of the library installed). While this is already enough for most users, there is also the case where a developer wants to target a different version (older or newer) of the library than the one installed in the system.

A typical case for this is developing applications using GNOME’s jhbuild infrastructure, targeted either to a given GNOME release or to git master of the involved modules. In this case, if you want to use new methods of let’s say GTK+, you usually end up needing to fire up a web browser and looking for the latest GTK+ documentation either in developer.gnome.org or in your jhbuild’s ${prefix}/share/gtk-doc/html directory.

In order to avoid this, I’m prototyping some ideas to let the users switch between different profiles, e.g.:

  • The ‘local’ profile, which is equivalent to what Devhelp currently shows.
  • A user-defined ‘jhbuild’ profile, which could point to the install prefix of the jhbuild setup.
  • Other user-defined profiles, which could point to other prefixes where the user has installed the newer (or older) libraries and their documentation.
  • Profiles for each new GNOME release, e.g. 3.12, which could get downloaded from developer.gnome.org as a tarball containing all documentation for a given release.

The most challenging case is probably the last one, given that it would require some extra work in the website in order to make sure the documentation tarball is generated and published in every new release, plus of course client-side management of these downloaded profiles in Devhelp.

For now this is just a basic set of ideas, the final result may or may not be similar; we’re of course open to suggestions!

sponsored-badge-simple


Filed under: Development, FreeDesktop Planet, GNOME Planet, GNU Planet, Planets Tagged: devhelp, documentation, gnome, gtk-doc
April 30, 2014
10 years ago, I committed the first version of a browser plugin in Totem's source code tree. Today, it's going away.

The landscape of video on the Web changed, then changed back again, and web technologies have moved on. We've witnessed:

  • The fall of RealPlayer
  • The rise of Flash video players, as a way to turn videos into black boxes with minimal "copy protection" (cf. "YouTube downloader" in your favourite search engine)
  • The rise and precipitous fall of Silverlight (with only a handful of websites, ever, or still, using it)
  • And most importantly, the advent of HTML5's <video> tag
Totem's browser plugin did as good a job as it could mimicking legacy web browser plugins from other platforms, such as QuickTime or Windows Media Player (even we stopped caring about the RealPlayer mimicking).

It wasn't helped by the ill-defined Netscape Plugin APIs (NPAPI) which meant that we never knew whether we'd receive a stream for the video we were about to play, or maybe not at all, and when you request one, you'd get one automatic one and the one you requested, or whether it would download empty files. Or we couldn't tell to open in another application when clicking directly on a file. All in all, pretty dire.

We made attempts at replacing the Flash plugin for playing back videos, but the NPAPI meant that we needed to handle everything or nothing. Ideally, we'd have been able to tell the browser to use our browser plugin for websites that we could support through libquvi, and either fallback to a placeholder or the real Flash plugin for other cases. NPAPI didn't allow us to do that.

The current state of media playback in browsers on Linux is such that:
Given all this, and the facts that Totem's browser plugin will not work on Wayland (it uses XEmbed to slot into the browser UI), that its UI is pretty broken since the redesign of the main player (not unfixable, but time consuming), and that it does not work properly in GNOME's own web browser (due to bad interactions between Clutter and GL acceleration in WebKit), I think it's time to call it a day.

Good bye Totem browser plugin.

I'll miss the clever puns of your compatibility plugins (Real Player/Complex and QuickTime/NarrowSpace being the best ones). I won't miss interacting with ill-defined APIs and buggy implementations.

I've just updated the post about X.Org synaptics support for the Lenovo T440, T540, X240, Helix, Yoga, X1 Carbon. For those following my blog, here is a rough diff of the updates:

  • All touchpads in this series need a kernel quirk to fix the min/max ranges. It's like a happy meal toy, make sure you collect them all.
  • A new kernel evdev input prop INPUT_PROP_TOPBUTTONPAD is available in 3.15. It marks the devices that require top software buttons. It will be backported to stable.
  • A new option was added HasSecondarySoftButtons was added to the synaptics driver. It is automatically set if INPUT_PROP_TOPBUTTONPAD is set and if set, the driver parses the SecondarySoftButtonAreas option and honours the values in it.
  • If you have the kernel min/max fixes and the new property, don't bother with DMI matching. Provide a xorg.conf.d snippet that unconditionally merges the SecondarySoftButtonAreas and rely on the driver for parsing it when appropriate

Updates: 30 April 2014, for the new INPUT_PROP_TOP_BUTTONPAD

This is a follow-up to my post from December Lenovo T440 touchpad button configuration. Except this time the support is real, or at least close to being finished. Since I am now seeing more and more hacks to get around all this I figured it's time for some info from the horse's mouth.

[update] I forgot to mention: synaptics 1.8 will have all these, the first snapshot is available here

Lenovo's newest series of laptops have a rather unusual touchpad. The trackstick does not have a set of physical buttons anymore. Instead, the top part of the touchpad serves as software-emulated buttons. In addition, the usual ClickPad-style software buttons are to be emulated on the bottom edge of the touchpad. An ASCII-art of that would look like this:


+----------------------------+
| LLLLLLLLLL MMMMM RRRRRRRRR |
| |
| |
| |
| |
| |
| |
| LLLLLLLL RRRRRRRR |
+----------------------------+
Getting this to work required a fair bit of effort, patches to synaptics, the X server and the kernel and a fair bit of trial-and-error. Kudos for getting all this sorted goes to Hans the Goede, Benjamin Tissoires, Chandler Paul and Matthew Garrett. And in the process of fixing this we also fixed a bunch of other issues that have been plaguing clickpads for a while.

The first piece in the puzzle was to add a second software button area to the synaptics driver. Option "SecondarySoftButtonAreas" now allows a configuration in the same manner as the existing one (i.e. right and middle button). Any click in that software button area won't move the cursor, so the buttons will behave just like physical buttons. Option "HasSecondarySoftButtons" defines if that button area is to be used. Of course, we expect that button area to work out of the box, so we now ship configuration files that detect the touchpad and apply that automatically. Update 30 Apr: Originally we tried to get this done based on the PNPID or DMI matching but a better solution is the new INPUT_PROP_TOPBUTTONPAD evdev property bit. This is now applied to all these touchpads, and the synaptics driver uses this to enable the secondary software button area. This bit will be aviailable in kernel 3.15, with stable backports happening after that.

The second piece in the puzzle was to work around the touchpad firmware. The touchpads speak two protocols, RMI4 over SMBus and PS/2. Windows uses RMI4, Linux still uses PS/2. Apparently the firmware never got tested for PS/2 so the touchpad gives us bogus data for its axis ranges. A kernel fix for this is in the pipe. Update 30 Apr: every single touchpad of this generation needs a fix. They have been or are being merged.

Finally, the touchpad needed to be actually usable. So a bunch of patches that tweak the clickpad behaviours were merged in. If a finger is set down inside a software button area, finger movement does no longer affect the cursor. This stops the ever-so-slight but annoying movements when you execute a physical click on the touchpad. Also, there is a short timeout after a click to avoid cursor movement when the user just presses and releases the button. The timeout is short enough that if you do a click-and-hold for drag-and-drop, the cursor will move as expected. If a touch started outside a software button area, we can now use the whole touchpad for movement. And finally, a few fixes to avoid erroneous click events - we'd sometimes get the software button wrong if the event sequence is off.

Another change changed the behaviour of the touchpad when it is disabled through the "Synaptics Off" property. If you use syndaemon to disable the touchpad while typing, the buttons now work even when the touchpad is disabled. If you don't like touchpads at all and prefer to use the trackstick only, use Option "TouchpadOff" "1". This will disable everything but physical clicks on the touchpad.

On that note I'd also like to mention another touchpad bug that was fixed in the recent weeks: plenty of users reported synaptics having a finger stuck after suspend/resume or sometimes even after logging in. This was an elusive bug and finally tracked down to a mishandling of SYN_DROPPED events in synaptics 1.7 and libevdev. I won't provide a fix for synaptics 1.7 but we've fixed libevdev - please use synaptics 1.8 RC1 or later and libevdev 1.1 RC1 or later.

Update 30 Apr: If the INPUT_PROP_TOPBUTTONPAD is not available on your kernel, you can use DMI matching through udev rules. PNPID matching requires a new kernel patch as well, at which point you might as well rely on the INPUT_PROP_TOPBUTTONPAD property. An example for udev rules that we used in Fedora is below:


ATTR{[dmi/id]product_version}=="*T540*", ENV{ID_INPUT.tags}="top_softwarebutton_area"
and with the matching xorg.conf snippet:

Section "InputClass"
Identifier "Lenovo T540 trackstick software button buttons"
MatchTag "top_softwarebutton_area"
Option "HasSecondarySoftButtons" "on"
# If you dont have the kernel patches for your touchpad
# to fix the min/max ranges, you need to use absolute coordinates
# Option "SecondarySoftButtonAreas" "3363 0 0 2280 2717 3362 0 2280"
Option "SecondarySoftButtonAreas" "58% 0 0 8% 42% 58% 0 8%"
EndSection
Update 30 Apr: For those touchpads that already have the kernel fix to adjust the min/max range, simply specifying the buttons in % of the touchpad dimensions is sufficient. For all other touchpads, you'll need to use absolute coordinates.

Fedora users: everything is being built in rawhide Update 30 Apr:, F20 and F19. The COPR listed in an earlier version of this post is not available anymore.

April 29, 2014

When Solaris 11.1 came out in October 2012, I posted about the changes to the included FOSS packages. With the publication today of Solaris 11.2 beta, I thought it would be nice to revisit this and see what’s changed in the past year and a half. This time around, I’m including some bundled packages that aren’t necessarily covered by a free software or open source license, but are of interest to Solaris users.

Removing software in updates

Last time I discussed how IPS allowed us to make a variety of changes in update releases much more easily than in the Solaris 10 package system. One of these changes is obsoleting packages, and we’ve done that in a couple rare cases in both Solaris 11.1 and 11.2 where the software is abandoned by the upstream, and we’ve decided it would be worse to keep it around, potentially broken, than to remove it on upgrade.

When we do this, notices will be posted to the End of Features for Solaris 11 web page, alongside the list of features that have been declared deprecated and may be removed in future releases. As you can see there, in Solaris 11.1 the Adobe Flash Player and tavor HCA driver packages were removed.

In Solaris 11.2, three more packages have been removed. slocate was a “secure” version of the locate utility, which wouldn’t show a user any files that they didn’t have permission to access. Unfortunately, this utility was broken by changes in the AST library, and since there is no longer an upstream for it, we decided to follow the lead of several Linux distros and moved to mlocate instead, which is added in this release.

The other two removed packages are both Xorg video drivers - the nv driver for NVIDIA graphics, and the trident driver for old Trident graphics chipsets. Most users will not notice these removals, but if you had manually created an xorg.conf file specifying one of these drivers, you may need to edit it to use the vesa driver instead.

NVIDIA had previously supported the nv open source driver and contributed updates to X.Org to support new chipsets in it, but in 2010, they announced they would no longer do so, and considered nv deprecated, recommending the use of the VESA driver for those who had no better driver to use. While we had continued to ship the nv driver in Solaris, it led to an increasing number of crashes, hangs, and other bugs for which the resolution was to remove the nv driver and use vesa instead, so we are removing it to end those issues. For systems with graphics devices new enough to be supported by the bundled nvidia closed-source driver, this will have no effect. For those with older devices, this will cause Xorg autoconfiguration to load the vesa driver instead, until and unless the user downloads & installs an appropriate NVIDIA legacy driver.

The trident driver was still in Solaris even after we dropped 32-bit support on x86, and years after Trident Microsystems exited the graphics business and sold its graphics product line to XGI, as the Sun Fire V20z server included a Trident chipset for the console video device. Unfortunately, the upstream driver has been basically unmaintained since then, and Oracle has had to apply patches to port to new Xorg releases. Meanwhile, in order to resolve bugs that caused system hangs, the trident driver was modified to not load on V20z systems, which left us shipping an unmaintained driver solely for a system that could not use it, but uses the vesa driver instead, so we decided to remove it as well.

If you had either of these Xorg driver packages installed, then when you update to 11.2, then pkg update will inform you there are release notes for these drivers, to warn you of the possibility you may need to edit your xorg.conf.

System Management Stack

The popular Puppet system for automating configuration changes across machines has been included in Solaris, and updated to support several Solaris features in both the framework and in individual configuration provideers. For instance, configuration changes made via Puppet will be recorded in the Solaris audit logs as part of a puppet session, and Puppet’s configuration file is generated from SMF properties using the new SMF stencil facilities. Providers are included that can configure IPS publishers, SMF properties, ZFS datasets, Solaris boot environments, and a variety of Solaris NIC, VNIC, and VLAN settings.

Another addition is the Oracle Hardware Management Pack (HMP), a set of tools that work with the ILOM, firmware, and other components in Sun/Oracle servers to configure low-level system options. Previously these needed to be downloaded and installed separately, now they are a simple pkg install away, and kept up to date with the rest of the OS.

A collaboration with Intel led to the integration of a Solaris port of Intel’s numatop tool for observing memory access locality across CPUs.

From the open source world, we’ve integrated several tools to allow admins and users to do multiple things at once, including the tmux terminal multiplexer, cssh tool for cluster administration via ssh, and GNU Parallel for running commands in parallel.

Developer Stack

For developers, GNU Compiler Collection (gcc) versions 4.7 & 4.8 are added alongside the previous 3.4 & 4.5 packages, and the gcc packages have been refactored to better allow installing different subsets of compilers. Other updated developer tools include Mercurial 2.8.2, GNU emacs 24.3, pylint 0.25.2, and version 7.6 of the GNU debugger, gdb. Newly added tools for developers include GNU indent, JavaScript Lint, and Python’s pep8.

The Java 8 development kit & runtime environment are both available as well. The default installation clusters will only install Java 7, but you can install the Java 8 runtime with “pkg install jre-8” or get both the runtime & development kits with “pkg install jdk-8”. The /usr/java mediated link, through which all the links in /usr/bin for the java, jar, javac, etc. commands flow will be set by default to the most recent version installed, so installing Java 8 will make that version default. You can see this via “ls -l /usr/java” reporting:

lrwxrwxrwx   1 root   root     15 Apr 23 14:01 /usr/java -> jdk/jdk1.8.0_05
or via “pkg mediator java” reporting:
MEDIATOR     VER. SRC. VERSION IMPL. SRC. IMPLEMENTATION
java         system    1.8     system
If you want to choose a different version to be default, you can manually set the mediator to that version with “pkg set-mediator -V 1.7 java”. Of course, for many operations, you can directly access any installed java version via the full path, such as /usr/jdk/instances/jdk1.8.0/bin/java instead of relying on the /usr/bin symlinks.

One caveat to be aware of is that Java 8 for Solaris is only provided as 64-bit binaries, as all Solaris 11 and later machines are running 64-bit now. This means that any JNI modules you rely on will need to be compiled as 64-bit and any programs that try to load Java must be 64-bit. There is also no 64-bit version provided of either the Java plugin for web browsers, or the Java Webstart program for starting Java client applications from web pages.

Desktop Stack

Most of the changes in the desktop stack in this release were updates needed to fix security issues, and are mostly covered on the Oracle Third Party Vulnerability Resolution Blog.

There were some feature updates in the X Window System layers of the desktop stack though – most notably the Xorg server was upgraded from 1.12 to 1.14, and the accompanying Mesa library was upgraded to version 9.0.3, which includes support for OpenGL 3.1 and GLSL 1.40 on Intel graphics. The bundled version of NVIDIA’s graphics driver was also updated, to NVIDIA’s latest “long lived branch” - 331. For users with older graphics cards which are no longer supported in this branch, legacy branches are available from NVIDIA’s Unix driver download site.

OpenStack

And last, but certainly not least, especially in the number of packages added to the repository, is the addition of OpenStack support in Solaris. The Cinder Block Storage Service, Glance Image Service, Horizon Dashboard, Keystone Identity Service, Neutron Networking Service, and Nova Compute Service from the OpenStack Grizzly (2013.1) release are all provided, in versions tested and integrated with Solaris features. Between the Open Stack packages themselves and all the python modules required for them, there’s over 100 new FOSS packages in this release.

Detailed list of changes

This table shows most of the changes to the bundled packages between the original Solaris 11.1 release, the latest Solaris 11.1 support repository update (SRU18, released April 14, 2014), and the Solaris 11.2 beta released today.

As with last time, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11.111.1 SRU1811.2 Beta
archiver/gnu-tarGNU tar1.261.261.27.1
archiver/unrarUnRAR4.1.44.1.44.2.4
cloud/openstack/cinderOpenStacknot includednot included0.2013.1.4
cloud/openstack/glanceOpenStacknot includednot included0.2013.1.4
cloud/openstack/horizonOpenStacknot includednot included0.2013.1.4
cloud/openstack/keystoneOpenStacknot includednot included0.2013.1.4
cloud/openstack/neutronOpenStacknot includednot included0.2013.1.4
cloud/openstack/novaOpenStacknot includednot included0.2013.1.4
communication/im/pidginpidgin2.10.52.10.52.10.9
compress/gzipGNU gzip1.41.51.5
compress/pbzip2Parallel bzip2not includednot included1.1.6
compress/pixzpixznot includednot included1.0
crypto/gnupgGnuPG2.0.172.0.172.0.22
database/berkeleydb-5Oracle Berkeley DB5.1.255.1.255.3.21
database/mysql-55MySQLnot includednot included5.5.31
database/sqlite-3SQLite3.7.113.7.14.13.7.14.1
desktop/window-manager/twmX.Org1.0.71.0.71.0.8
developer/build/antApache Ant1.7.11.8.41.8.4
developer/build/autoconf/xorg-macrosX.Org1.171.171.17.1
developer/build/imakeX.Org1.0.51.0.51.0.6
developer/build/makedependX.Org1.0.41.0.41.0.5
developer/debug/gdbGNU GDB6.86.87.6
developer/gcc-47GNU Compiler Collectionnot includednot included4.7.3
developer/gcc-48GNU Compiler Collectionnot includednot included4.8.2
developer/gnu-indentGNU indentnot includednot included2.2.9
developer/java/jdk-6Java1.6.0.351.6.0.751.6.0.75
developer/java/jdk-7Java1.7.0.71.7.0.55.131.7.0.55.13
developer/java/jdk-8Javanot includednot included1.8.0.5.13
developer/java/junitJUnit4.104.104.11
developer/javascript/jslJavaScript Lintnot includednot included0.3.0
developer/python/pylintpylint0.18.00.18.00.25.2
developer/versioning/mercurialMercurial SCM2.2.12.2.12.8.2
diagnostic/nmapnmap5.516.256.25
diagnostic/numatopnumatopnot includednot included1.0
diagnostic/scanpciX.Org0.13.10.13.10.13.2
diagnostic/tcpdumptcpdump4.1.14.5.14.5.1
diagnostic/wiresharkWireshark1.8.21.8.121.10.6
diagnostic/xloadX.Org1.1.11.1.11.1.2
document/viewer/xditviewX.Org1.0.21.0.21.0.3
driver/graphics/nvidiaNVIDIA0.295.20.00.295.20.00.331.38.0
editor/gnu-emacsGNU Emacs23.423.424.3
editor/xeditX.Org1.2.01.2.01.2.1
file/gnu-coreutilsGNU Coreutils8.58.58.16
file/mcGNU Midnight Commander4.7.5.24.7.5.24.8.8
file/mlocatemlocatenot includednot included0.25
file/slocate3.13.1not included
image/editor/bitmapX.Org1.0.61.0.61.0.7
image/imagemagickImageMagick6.3.4.26.8.3.56.8.3.5
library/cacaoCommon Agent Container2.3.1.02.4.2.02.4.2.0
library/graphics/pixmanX.Org0.24.40.24.40.29.2
library/libarchivelibarchivenot includednot included3.0.4
library/libmilterSendmail8.14.58.14.78.14.7
library/libxml2XML C parser2.7.62.7.62.9.1
library/libxsltlibxslt1.1.261.1.261.1.28
library/neonneon0.29.50.29.50.29.6
library/perl-5/perl-x11-protocolCPAN: X11-Protocolnot includednot included0.56
library/perl-5/xml-libxmlCPAN: XML::LibXMLnot included2.142.14
library/perl-5/xml-namespacesupportCPAN: XML::NamespaceSupportnot included1.111.11
library/perl-5/xml-parser-threaded-512CPAN: XML::Parsernot included2.362.36
library/perl-5/xml-saxCPAN: XML::SAXnot included0.990.99
library/perl-5/xml-sax-baseCPAN: XML::SAX::Basenot included1.081.08
library/perl5/perl-tkCPAN: Tknot includednot included804.31
library/python-2/alembicalembicnot includednot included0.6.0
library/python-2/amqpamqpnot includednot included1.0.12
library/python-2/anyjsonanyjsonnot includednot included0.3.3
library/python-2/argparseargparsenot included1.2.11.2.1
library/python-2/babelbabelnot includednot included1.3
library/python-2/beautifulsoup4beautifulsoup4not includednot included4.2.1
library/python-2/botobotonot includednot included2.9.9
library/python-2/cheetahcheetahnot includednot included2.4.4
library/python-2/cliffcliffnot includednot included1.4.5
library/python-2/cmd2cmd2not includednot included0.6.7
library/python-2/cov-corecov-corenot includednot included1.7
library/python-2/cssutilscssutilsnot includednot included0.9.6
library/python-2/d2to1d2to1not includednot included0.2.10
library/python-2/decoratordecoratornot includednot included3.4.0
library/python-2/djangodjangonot includednot included1.4.10
library/python-2/django-appconfdjango-appconfnot includednot included0.6
library/python-2/django_compressordjango_compressornot includednot included1.3
library/python-2/django_openstack_authOpenStacknot includednot included1.1.3
library/python-2/eventleteventletnot includednot included0.13.0
library/python-2/filechunkiofilechunkionot includednot included1.5
library/python-2/formencodeformencodenot includednot included1.2.6
library/python-2/greenletgreenletnot includednot included0.4.1
library/python-2/httplib2httplib2not includednot included0.8
library/python-2/importlibimportlibnot includednot included1.0.2
library/python-2/ipythonipythonnot includednot included0.10
library/python-2/iso8601iso8601not includednot included0.1.4
library/python-2/jsonpatchjsonpatchnot includednot included1.1
library/python-2/jsonpointerjsonpointernot includednot included1.0
library/python-2/jsonschemajsonschemanot includednot included2.0.0
library/python-2/kombukombunot includednot included2.5.12
library/python-2/lesscpylesscpynot includednot included0.9.10
library/python-2/librabbitmqlibrabbitmqnot includednot included1.0.1
library/python-2/libxml2-26libxml22.7.62.7.62.9.1
library/python-2/libxml2-27libxml22.7.62.7.62.9.1
library/python-2/libxsl-26libxsl1.1.261.1.261.1.28
library/python-2/libxsl-27libxsl1.1.261.1.261.1.28
library/python-2/lockfilelockfilenot includednot included0.9.1
library/python-2/logilab-astnglogilab-astng0.19.00.19.00.24.0
library/python-2/logilab-commonlogilab-common0.40.00.40.00.58.2
library/python-2/markdownmarkdownnot includednot included2.3.1
library/python-2/markupsafemarkupsafenot includednot included0.18
library/python-2/mockmocknot includednot included1.0.1
library/python-2/netaddrnetaddrnot includednot included0.7.10
library/python-2/netifacesnetifacesnot includednot included0.8
library/python-2/nosenose1.1.21.1.21.2.1
library/python-2/nose-cover3nose-cover3not includednot included0.0.4
library/python-2/ordereddictordereddictnot includednot included1.1
library/python-2/oslo.configoslo.confignot includednot included1.2.1
library/python-2/passlibpasslibnot includednot included1.6.1
library/python-2/pastepastenot includednot included1.7.5.1
library/python-2/paste.deploypaste.deploynot includednot included1.5.0
library/python-2/pbrpbrnot includednot included0.5.21
library/python-2/pep8pep8not includednot included1.4.4
library/python-2/pippipnot includednot included1.4.1
library/python-2/prettytableprettytablenot includednot included0.7.2
library/python-2/pypynot includednot included1.4.15
library/python-2/pyasn1pyasn1not includednot included0.1.7
library/python-2/pyasn1-modulespyasn1-modulesnot includednot included0.0.5
library/python-2/pycountrypycountrynot includednot included0.17
library/python-2/pydnspydnsnot includednot included2.3.6
library/python-2/pyflakespyflakesnot includednot included0.7.2
library/python-2/pygmentspygmentsnot includednot included1.6
library/python-2/pyopensslpyopenssl0.110.110.13
library/python-2/pyparsingpyparsingnot includednot included2.0.1
library/python-2/pyrabbitpyrabbitnot includednot included1.0.1
library/python-2/pytestpytestnot includednot included2.3.5
library/python-2/pytest-capturelogpytest-capturelognot includednot included0.7
library/python-2/pytest-codecheckerspytest-codecheckersnot includednot included0.2
library/python-2/pytest-covpytest-covnot includednot included1.6
library/python-2/python-dbus-26D-Bus0.83.20.83.21.1.1
library/python-2/python-imagingpython-imagingnot includednot included1.1.7
library/python-2/python-ldappython-ldapnot includednot included2.4.10
library/python-2/python-mysqlpython-mysqlnot includednot included1.2.2
library/python-2/python-zope-interfaceZopenot includednot included3.3.0
library/python-2/pytzpytznot includednot included2013.4
library/python-2/repoze.lrurepoze.lrunot includednot included0.6
library/python-2/requestsrequestsnot includednot included1.2.3
library/python-2/routesroutesnot includednot included1.13
library/python-2/setuptools-gitsetuptools-gitnot includednot included1.0
library/python-2/simplejsonsimplejsonnot includednot included2.1.2
library/python-2/sixsixnot includednot included1.4.1
library/python-2/sqlalchemysqlalchemynot includednot included0.7.9
library/python-2/sqlalchemy-migratesqlalchemy-migratenot includednot included0.7.2
library/python-2/stevedorestevedorenot includednot included0.10
library/python-2/sudssudsnot includednot included0.4
library/python-2/tempitatempitanot includednot included0.5.1
library/python-2/toxtoxnot includednot included1.4.3
library/python-2/unittest2unittest2not includednot included0.5.1
library/python-2/virtualenvvirtualenvnot includednot included1.9.1
library/python-2/waitresswaitressnot includednot included0.8.5
library/python-2/warlockwarlocknot includednot included1.0.1
library/python-2/webobwebobnot includednot included1.2.3
library/python-2/websockifywebsockifynot includednot included0.3.0
library/python-2/webtestWebTestnot includednot included2.0.6
library/python/cinderclientOpenStacknot includednot included1.0.7
library/python/glanceclientOpenStacknot includednot included0.12.0
library/python/keystoneclientOpenStacknot includednot included0.4.1
library/python/neutronclientOpenStacknot includednot included2.3.1
library/python/novaclientOpenStacknot includednot included2.15.0
library/python/quantumclientOpenStacknot includednot included2.2.4.3
library/python/swiftclientOpenStacknot includednot included2.0.2
library/security/libgpg-errorGnuPG1.101.121.12
library/security/opensslOpenSSL1.0.0.10 (1.0.0j)1.0.0.11 (1.0.0k)1.0.1.7 (1.0.1g)
library/security/openssl/openssl-fips-140OpenSSL1.21.22.0.6
mail/fetchmailfetchmail6.3.216.3.226.3.22
mail/thunderbirdMozilla Thunderbird10.0.61717.0.6
mail/thunderbird/plugin/thunderbird-lightningMozilla Lightning10.0.61717.0.6
media/cdrtoolsCDrecord3.03.03.1
network/amqp/rabbitmqRabbitMQnot includednot included3.1.3
network/dns/bindISC BIND9.6.3.7.2
(9.6-ESV-R7-P2)
9.6.3.10.2
(9.6-ESV-R10-P2)
9.6.3.10.2
(9.6-ESV-R10-P2)
network/rsyncrsync3.0.83.0.83.0.9
package/pkgbuildpkgbuild1.3.1041.3.1041.3.105
print/filter/hplipHPLIP3.10.93.10.93.12.4
runtime/clispGNU CLISP2.472.472.49
runtime/erlangErlang12.2.512.2.515.2.3
runtime/java/jre-6Java1.6.0.351.6.0.751.6.0.75
runtime/java/jre-7Java1.7.0.71.7.0.55.131.7.0.55.13
runtime/java/jre-8Javanot includednot included1.8.0.5.13
runtime/perl-512Perl5.12.45.12.55.12.5
runtime/perl-threaded-512Perlnot included5.12.55.12.5
runtime/ruby-18Ruby1.8.7.3571.8.7.3741.8.7.374
runtime/ruby-19Rubynot includednot included1.9.3.484
runtime/ruby-19/ruby-tkRubynot includednot included1.9.3.484
runtime/tcl-8Tcl/Tk8.5.98.5.98.5.12
runtime/tcl-8/tcl-sqlite-33.7.113.7.14.13.7.14.1
runtime/tk-8Tcl/Tk8.5.98.5.98.5.12
security/compliance/openscapOpenSCAP0.8.10.8.11.0.0
security/sudoSudo1.8.4.51.8.6.71.8.6.7
service/memcachedMemcached1.4.51.4.171.4.17
service/network/dhcp/isc-dhcpISC DHCP4.1.0.64.1.0.74.1.0.7
service/network/dns/bindISC BIND9.6.3.7.2 (9.6-ESV-R7-P2)9.6.3.10.2 (9.6-ESV-R10-P2)9.6.3.10.2 (9.6-ESV-R10-P2)
service/network/dnsmasqDnsmasqnot includednot included2.68
service/network/ftpProFTPD1.3.3.0.7 (1.3.3g)1.3.4.0.3 (1.3.4c)1.3.4.0.3 (1.3.4c)
service/network/ntpNTP4.2.5.200 (4.2.5p200)4.2.7.381 (4.2.7p381)4.2.7.381 (4.2.7p381)
service/network/ptpPTPdnot includednot included2.2.0
service/network/sambaSamba3.6.63.6.233.6.23
service/network/smtp/sendmailSendmail8.14.58.14.78.14.7
service/security/stunnelstunnel4.294.294.56
shell/gnu-getoptGNU getoptnot includednot included1.1.5
shell/parallelGNU parallelnot includednot included0.2012.11.22
shell/tcshtcsh6.17.06.18.16.18.1
shell/zshZsh4.3.174.3.175.0.5
system/library/dbusD-Bus1.2.281.2.281.7.1
system/library/freetype-2FreeType2.4.92.4.112.4.11
system/library/hmp-libsHMPnot includednot included2.2.8
system/library/libdbusD-Bus1.2.281.2.281.7.1
system/library/libdbus-glibD-Bus0.880.880.100
system/library/libpcaptcpdump1.1.11.5.11.5.1
system/library/security/libgcryptGNU libgcrypt1.4.51.5.31.5.3
system/management/biosconfigHMPnot includednot included2.2.8
system/management/facterPuppetnot includednot included1.6.18
system/management/fwupdateHMPnot includednot included2.2.8
system/management/fwupdate/emulexHMPnot includednot included6.3.12.2
system/management/fwupdate/qlogicHMPnot includednot included1.7.3
system/management/hmp-snmpHMPnot includednot included2.2.8
system/management/hwmgmtcliHMPnot includednot included2.2.8
system/management/hwmgmtdHMPnot includednot included2.2.8
system/management/ipmitoolipmitool1.8.111.8.111.8.12
system/management/puppetPuppetnot includednot included3.4.1
system/management/raidconfigHMPnot includednot included2.2.8
system/management/ubiosconfigHMPnot includednot included2.2.8
system/storage/sg3_utilssg3_utils1.281.281.33
system/test/sunvts7.0.147.17.17.18.0
terminal/csshCluster SSHnot includednot included4.2.1
terminal/tmuxtmuxnot includednot included1.8
text/gnu-grepGNU grep2.102.142.14
text/texinfoGNU texinfo4.74.134.13
web/browser/firefoxMozilla Firefox10.0.61717.0.6
web/java-servlet/tomcatApache Tomcat6.0.356.0.376.0.39
web/php-53PHP5.3.145.3.275.3.28
web/php-53/extension/php-zendopcacheZend OPcachenot includednot included7.0.2
web/proxy/squidsquid3.1.183.1.233.1.23
web/server/apache-22Apache HTTPD2.2.222.2.252.2.27
web/server/apache-22/module/apache-fcgidApache FastCGI2.3.62.3.92.3.9
web/server/apache-22/module/apache-php53PHP5.3.145.3.275.3.28
web/server/apache-22/module/apache-securityModSecurity2.5.92.5.92.7.5
web/server/apache-22/module/apache-sedApache HTTPD2.2.222.2.222.2.27
web/server/lighttpd-14Lighttpd1.4.231.4.331.4.35
web/wgetGNU wget1.121.121.14
x11/data/xcursor-themesX.Org1.0.31.0.31.0.4
x11/demo/mesa-demosMesa 3-D8.0.18.0.18.1.0
x11/diagnostic/intel-gpu-toolsX.Orgnot includednot included1.3
x11/diagnostic/xevX.Org1.2.01.2.01.2.1
x11/diagnostic/xscopeX.Org1.3.11.3.11.4
x11/library/libdmxX.Org1.1.21.1.21.1.3
x11/library/libdrmDRI2.4.322.4.322.4.43
x11/library/libfontencX.Org1.1.11.1.11.1.2
x11/library/libfsX.Org1.0.41.0.41.0.5
x11/library/libsmX.Org1.2.11.2.11.2.2
x11/library/libx11X.Org1.5.01.5.01.6.2
x11/library/libxauX.Org1.0.71.0.71.0.8
x11/library/libxcbXCB1.8.11.8.11.9.1
x11/library/libxcompositeX.Org0.4.30.4.30.4.4
x11/library/libxcursorX.Org1.1.131.1.131.1.14
x11/library/libxdamageX.Org1.1.31.1.31.1.4
x11/library/libxextX.Org1.3.11.3.11.3.2
x11/library/libxfixesX.Org5.05.05.0.1
x11/library/libxfontX.Org1.4.51.4.51.4.7
x11/library/libxiX.Org1.6.11.6.11.7.2
x11/library/libxineramaX.Org1.1.21.1.21.1.3
x11/library/libxmuX.Org1.1.11.1.11.1.2
x11/library/libxmuuX.Org1.1.11.1.11.1.2
x11/library/libxpX.Org1.0.11.0.11.0.2
x11/library/libxpmX.Org3.5.103.5.103.5.11
x11/library/libxrandrX.Org1.3.21.3.21.4.2
x11/library/libxrenderX.Org0.9.70.9.70.9.8
x11/library/libxresX.Org1.0.61.0.61.0.7
x11/library/libxtstX.Org1.2.11.2.11.2.2
x11/library/libxvX.Org1.0.71.0.71.0.10
x11/library/libxvmcX.Org1.0.71.0.71.0.8
x11/library/libxxf86vmX.Org1.1.21.1.21.1.3
x11/library/mesaMesa 3-D7.11.27.11.29.0.3
x11/library/toolkit/libxaw7X.Org1.0.111.0.111.0.12
x11/library/toolkit/libxtX.Org1.1.31.1.31.1.4
x11/library/xcb-utilXCB0.3.80.3.80.3.9
x11/server/xorgX.Org1.12.21.12.21.14.5
x11/server/xorg/driver/xorg-input-keyboardX.Org1.6.11.6.11.7.0
x11/server/xorg/driver/xorg-input-mouseX.Org1.7.21.7.21.9.0
x11/server/xorg/driver/xorg-input-synapticsX.Org1.6.21.6.21.7.1
x11/server/xorg/driver/xorg-input-vmmouseX.Org12.8.012.8.013.0.0
x11/server/xorg/driver/xorg-video-astX.Org0.93.100.93.100.97.0
x11/server/xorg/driver/xorg-video-atiX.Org6.14.46.14.46.14.6
x11/server/xorg/driver/xorg-video-cirrusX.Org1.4.01.4.01.5.2
x11/server/xorg/driver/xorg-video-dummyX.Org0.3.50.3.50.3.6
x11/server/xorg/driver/xorg-video-intelX.Org2.18.02.18.02.21.5
x11/server/xorg/driver/xorg-video-mach64X.Org6.9.16.9.16.9.4
x11/server/xorg/driver/xorg-video-mgaX.Org1.5.01.5.01.6.2
x11/server/xorg/driver/xorg-video-nvX.Org2.1.182.1.18not included
x11/server/xorg/driver/xorg-video-openchromeX.Org0.2.9050.2.9050.3.2
x11/server/xorg/driver/xorg-video-r128X.Org6.8.26.8.26.8.4
x11/server/xorg/driver/xorg-video-tridentX.Org1.3.51.3.5not included
x11/server/xorg/driver/xorg-video-vesaX.Org2.3.12.3.12.3.2
x11/server/xorg/driver/xorg-video-vmwareX.Org12.0.212.0.213.0.1
x11/session/sessregX.Org1.0.71.0.71.0.8
x11/session/xinitX.Org1.3.21.3.21.3.3
x11/transsetX.Org1.0.01.0.01.0.1
x11/x11-window-dumpX.Org1.0.51.0.51.0.6
x11/xcalcX.Org1.0.4.11.0.4.11.0.5
x11/xclipboardX.Org1.1.21.1.21.1.3
x11/xclockX.Org1.0.61.0.61.0.7
x11/xconsoleX.Org1.0.41.0.41.0.6
x11/xfdX.Org1.1.11.1.11.1.2
x11/xfontselX.Org1.0.41.0.41.0.5
x11/xfsX.Org1.1.21.1.31.1.3
x11/xkillX.Org1.0.31.0.31.0.4
x11/xmagX.Org1.0.41.0.41.0.5
x11/xmanX.Org1.1.21.1.21.1.3
x11/xvidtuneX.Org1.0.21.0.21.0.3
April 27, 2014

Current Glamor Performance

I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package.

This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs.

Changes in X11perf

First off, I did some analysis of what x11perf was doing and found that it wasn’t quite measuring what we thought. I’m not interested in competing on x11perf numbers absolutely, I’m only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me.

When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn’t reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles.

The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result.

EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It’s a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines.

I “fixed” the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart.

I’ve pushed out these changes to my x11perf repository on freedesktop.org.

What’s Fast

Things that match GL’s capabilities are fast, things which don’t are slow. No surprises there. What’s interesting is precisely what matches GL

Patterns For Free

Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations.

GL Lines

Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case.

The other implicit requirement is that all zero width lines look the same. Right now, I’ve solved that for fill styles and raster ops as they’re all drawn with the same GL operations. However, for plane masks, we’re currently falling back to software, which may draw something different. Fixing that isn’t impossible, it’s just tedious.

Text

Pushing all of the work of drawing core text into glamor wasn’t terribly difficult, and the results are pretty spectacular.

What’s Slow

We’ve still got room for improvement in Glamor, but there aren’t any obvious show-stoppers to getting great performance for reasonable X applications anymore.

Wide Lines and Arcs

One of the speed-ups I’ve made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn’t do this, and so we suffer from submitting many smaller requests to GL. It’s hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days.

Filled Polygons

Because X only lets applications draw a single polygon in each request, Glamor can’t really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren’t that useful.

Render Operations

These are still not structured to best fit modern GL; some work here would help a bunch. We’ve got a gsoc student ready to go at this though, so I expect we’ll have much better numbers in a few months.

Window Operations

You wouldn’t think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here.

Unexpected Results

Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you’d expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles.

GL semantics for copying data essentially preclude overlapping blts of any form. There’s the NVtexturebarrier extension which at least lets us do blts within the same object, but even that doesn’t define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster.

Results

Here’s a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.

April 25, 2014

Today I released AppStream and libappstream 0.6.1, which feature mostly bugfixes, so nothing incredibly exciting to see there (but this also means no API/ABI breaks). The release clarifies some paragraphs in the spec which people found confusing, and fixes a few issues (like one example in the docs not being valid AppStream metadata). As only spec extension, we introduce a “priority” property in distro metadata to allow metadata from one repository to override data shipped by another one. This is used (although with a similar syntax) in Fedora already to have “placeholder” data for non-free stuff, which gets overridden by the real metadata if a new application was added. In general, the property tag was added to make the answer to the question “which data is preferred” much less magic.

The libappstream library got some new API to query component data in different ways, and I also brought back support for Vala (so if you missed the Vapi file: It’s back now, although you have to manually enable this feature).

The CLI tool also got some extensions to query AppStream data. Here is a brief introduction:

First of all, we need to make sure the database is up-to-date, which should be the case already (it is rebuilt automatically):

$ sudo appstream-index refresh

The database will only be rebuilt when necessary, if you want to force a rebuild anyway, use the “–force” parameter.

Now imagine we want to search for an app containing the word “media” (in description, keywords, summary, …):

$ appstream-index s media

which will return:

Identifier: gnome-media-player.desktop [desktop-app]
Name: GNOME Media Player
Summary: A simple media player for GNOME
Package: gnome-media-player
----
Identifier: mediaplayer-app.desktop [desktop-app]
Name: Media Player
Summary: Media Player
Package: mediaplayer-app
----
Identifier: kde4__plasma-mediacenter.desktop [desktop-app]
Name: Plasma Media Center
Summary: A mediacenter user interface written with the Plasma framework
Package: plasma-mediacenter
----
etc.

If we already know the name if a .desktop-file or the ID of a component, we can have the tool print out information about the application, including which package it was installed from:

$ appstream-index get lyx.desktop

If we want to see more details, including e.g. a screenshot URL and a longer description, we can pass “–details” to the tool:

Identifier: lyx.desktop [desktop-app]
Name: LyX
Summary: An advanced document processor with the power of LaTeX.
Package: lyx-common
Homepage: http://www.lyx.org/
Icon: lyx.png
Description: LyX is a document processor that encourages an approach to writing
 based on the structure of your documents (WYSIWYM) and not simply
 their appearance (WYSIWYG).
 
 LyX combines the power and flexibility of TeX/LaTeX[...]
Sample Screenshot URL: http://alt.fedoraproject.org/pub/alt/screenshots/f21/source/lyx-ea535ddf18b5c7328c5e88d2cd2cbd8c.png
License: GPLv2+

(I truncated the results slightly ;-) )

Okay, so far so good. But now it gets really exciting (and this is a feature added with 0.6.1): We can now query a component by the items it provides. For example, I want to know which software provides the library libfoo.2:

appstream-index what-provides lib libfoo.so.2

This also works with binaries, or Python modules:

appstream-index what-provides bin apper

This stuff works distribution-agnostic, and as long as software ships upstream metadata with a valid <provides/> field, or the distributor adds it while generating AppStream distro metadata.
This means that software can – as soon as we have sufficient metadata of this kind – declare it’s dependencies upstream in form of a simple text file, referencing the needed components to build and run it on any Linux distribution. Users can simply install missing stuff by passing that file to their package manager, which can look up the components->packaging mapping and versions and do the right thing in installing the dependencies. So basically, this allows things “pip -r” does for Python, but for any application (not only Python stuff), and based on the distributors package database.

With the provides-items, we can also scan software to detect it’s dependencies automatically (and have it in a distro-agnostic form directly). We can also easily search for missing mimetype-handlers, missing kernel-modules, missing firmware etc. to install it on-demand, making the system much smarter in handling it’s dependencies. And users don’t need to do search orgies to find the right component for a given task.

Also on my todo list for the future, based on this feature: A small tool telling upstream authors which distribution has their application in which version, using just one command (and AppStream data from multiple distros).

Also planned: A cross-distro information page showing which distros ship which library versions, Python modules and application versions (and also the support status of the distro), so developers know which library versions (or GCC versions etc.) they should at least support to make their application easily available on most distributions.

As always, you can get the releases on Freedesktop, as well es the AppStream specification.

April 17, 2014
Under that name is a simple idea: making it easier to save, load, update and query objects in an object store.

I'm not the main developer for this piece of code, but contributed a large number of fixes to it, while porting a piece of code to it as a test of the API. Much of the credit for the design of this very useful library goes to Christian Hergert.

The problem

It's possible that you've already implemented a data store inside your application, hiding your complicated SQL queries in a separate file because they contain injection security issues. Or you've used the filesystem as the store and threw away the ability to search particular fields without loading everything in memory first.

Given that SQLite pretty much matches our use case - it offers good search performance, it's a popular thus well-documented project and its files can be manipulated through a number of first-party and third-party tools - wrapping its API to make it easier to use is probably the right solution.

The GOM solution

GOM is a GObject based wrapper around SQLite. It will hide SQL from you, but still allow you to call to it if you have a specific query you want to run. It will also make sure that SQLite queries don't block your main thread, which is pretty useful indeed for UI applications.

For each table, you would have a GObject, a subclass of GomResource, representing a row in that table. Each column is a property on the object. To add a new item to the table, you would simply do:

item = g_object_new (ITEM_TYPE_RESOURCE,
"column1", value1,
"column2", value2, NULL);
gom_resource_save_sync (item, NULL);

We have a number of features which try to make it as easy as possible for application developers to use gom, such as:
  • Automatic table creation for string, string arrays, and number types as well as GDateTime, and transformation support for complex types (say, colours or images).
  • Automatic database version migration, using annotations on the properties ("new in version")
  • Programmatic API for queries, including deferred fetches for results
Currently, the main net gain in terms of lines of code, when porting SQLite, is the verbosity of declaring properties with GObject. That will hopefully be fixed by the GProperty work planned for the next GLib release.

The future

I'm currently working on some missing features to support a port of the grilo bookmarks plugin (support for column REFERENCES).

I will also be making (small) changes to the API to allow changing the backend from SQLite to a another one, such as XML, or a binary format. Obviously the SQL "escape hatches" wouldn't be available with those backends.

Don't hesitate to file bugs if there are any problems with the API, or its documentation, especially with respect to porting from applications already using SQLite directly. Or if there are bugs (surely, no).

Note that JavaScript support isn't ready yet, due to limitations in gjs.

¹: « SQLite don't hurt me, don't hurt me, no more »
April 16, 2014

Things are moving forward for the Fedora Workstation project. For those of you who don’t know about it, it is part of a broader plan to refocus Fedora around 3 core products with clear and distinctive usecase for each. The goal here is to be able to have a clear definition of what Fedora is and have something that for instance ISVs can clearly identify and target with their products. At the same time it is trying to move away from the traditional distribution model, a model where you primarily take whatever comes your way from upstream, apply a little duct tape to try to keep things together and ship it. That model was good in the early years of Linux existence, but it does not seem a great fit for what people want from an operating system today.

If we look at successful products MacOS X, Playstation 4, Android and ChromeOS the red thread between them is that while they all was built on top of existing open source efforts, they didn’t just indiscriminately shovel in any open source code and project they could find, instead they decided upon the product they wanted to make and then cherry picked the pieces out there that could help them with that, developing what they couldn’t find perfect fits for themselves. The same is to some degree true for things like Red Hat Enterprise Linux and Ubuntu. Both products, while based almost solely on existing open source components, have cherry picked what they wanted and then developed what pieces they needed on top of them. For instance for Red Hat Enterprise Linux its custom kernel has always been part of the value add offered, a linux kernel with a core set of dependable APIs.

Fedora on the other hand has historically followed a path more akin to Debian with a ‘more the merrier’ attitude, trying to welcome anything into the group. A metaphor often used in the Fedora community to describe this state was that Fedora was like a collection of Lego blocks. So if you had the time and the interest you could build almost anything with it. The problem with this state was that the products you built also ended up feeling like the creations you make with a random box of lego blocks. A lot of pointy edges and some weird looking sections due to needing to solve some of the issues with the pieces you had available as opposed to the piece most suited.

With the 3 products we are switching to a model where although we start with that big box of lego blocks we add some engineering capacity on top of it, make some clear and hard decisions on direction, and actually start creating something that looks and feels like it was made to be a whole instead of just assembled from a random set of pieces. So when we are planning the Fedora Workstation we are not just looking at what features we can develop for individual libraries or applications like GTK+, Firefox or LibreOffice, but we are looking at what we want the system as a whole to look like. And maybe most important we try our hardest to look at things from a feature/usecase viewpoint first as opposed to a specific technology viewpoint. So instead of asking ‘what features are there in systemd that we can expose/use in the desktop being our question, the question instead becomes ‘what new features do we want to offer our users in future versions of the product, and what do we need from systemd, the kernel and others to be able to do that’.

So while technologies such as systemd, Wayland, docker, btrfs are on our roadmap, they are not there because they are ‘cool technologies’, they are there because they provide us with the infrastructure we need to achieve our feature goals. And whats more we make sure to work closely with the core developers to make the technologies what we need them to be. This means for example that between myself and other members of the team we are having regular conversations with people such as Kristian Høgsberg and Lennart Poettering, and of course contributing code where possible.

To explain our mindset with the Fedora Workstation effort let me quickly summarize some old history. In 2001 Jim Gettys, one of the original creators of the X Window System did at talk a GUADEC in Sevile called ‘Draining the Swamp’. I don’t think the talk can be found online anywhere, but he outlined some of the same thoughts in this email reply to Richard Stallman some time later. I think that presentation has shaped the thinking of the people who saw it ever since, I know it has shaped mine. Jim’s core message was that the idea that we can create a great desktop system by trying to work around the shortcomings or weirdness in the rest of the operating system was a total fallacy. If we look at the operating system as a collection of 100% independent parts, all developing at their own pace and with their own agendas, we will never be able to create a truly great user experience on the desktop. Instead we need to work across the stack, fixing the issues we see where they should be fixed, and through that ‘drain the swamp’. Because if we continued to try to solve the problems by adding layers upon layers of workarounds and abstraction layers we would instead be growing the swamp, making it even more unmanageable. We are trying to bring that ‘draining the swamp’ mindset with us into creating the Fedora Workstation product.

With that in mind what is the driving ideas behind the Fedora Workstation? The Fedora Workstation effort is meant to provide a first class desktop for your laptop or workstation computer, combining a polished user interface with access to new technologies. We are putting a special emphasis on developers with our first releases, both looking at how we improve the desktop experience for developers, and looking at what tools we can offer to developers to let them be productive as quickly as possible. And to be clear when we say developers we are not only thinking about developers who wants to develop for the desktop or the desktop itself, but any kind of software developer or DevOPs out there.

The full description of the Fedora Workstation can be found here, but the essence of our plan is to create a desktop system that not only provides some incremental improvements over how things are done today, but which tries truly take a fresh look at how a linux desktop operating system should operate. The traditional distribution model, built up around software packages like RPM or Deb has both its pluses and minuses.
Its biggest challenge is probably that it creates a series of fiefdoms where a 3rd party developers can’t easily target the system or a family of systems except through spending time very specifically supporting each one. And even once a developers decides to commit to trying to support a given system it is not clear what system services they can depend on always being available or what human interface design they should aim for. Solving these kind of issues is part of our agenda for the new workstation.

So to achieve this we have decided on a set of core technologies to build this solution upon. The central piece of the puzzle is the so called LinuxApps proposal from Lennart Poettering. LinuxApps is currently a combination of high level ideas and some concrete building blocks. In terms of the building blocks are technologies such as Wayland, kdbus, overlayfs and software containers. The ideas side include developing a permission system similar to what you for instance see Android applications employ to decide what rights a given application has and develop defined versioned library bundles that 3rd party applications can depend on regardless of the version of the operating system. On the container side we plan on expanding on the work Red Hat is doing with Docker and Project Atomic.

In terms of some of the other building blocks I think most of you already know of the big push we are doing to get the new Wayland display server ready. This includes work on developing core infrastructure like libinput, a new library for handling input devices being developed by Jonas Ådahl and our own Peter Hutterer. There is also a lot of work happening on the GNOME 3 side of things to make GNOME 3 Wayland ready. Jasper St.Pierre wrote up a great blog blog entry outlining his work to make GDM and the GNOME Shell work better with Wayland. It is an ongoing effort, but there is a big community around this effort as most recently seen at the West Cost Hackfest at the Endless Mobile office.

As I mentioned there is a special emphasis on developers for the initial releases. These includes both a set of small and big changes. For instance we decided to put some time into improving the GNOME terminal application as we know it is a crucial piece of technology for a lot of developers and system administers alike. Some of the terminal improvements can be seen in GNOME 3.12, but we have more features lined up for the terminal, including the return of translucency. But we are also looking at the tools provided in general and the great thing here is that we are able to build upon a lot of efforts that Red Hat is developing for the Red Hat product portfolio, like Software Collections which gives easy access to a wide range of development tools and environments. Together with Developers Assistant this should greatly enhance your developers experience in the Fedora Workstation. The inclusion of Software collections also means that Fedora becomes an even better tool than before for developing software that you expect to deploy on RHEL, as you can be sure that an identical software collection will be available on RHEL that you have been developing against on Fedora as software collections ensure that you can have the exact same toolchain and toolchain versions available for both systems.

Of course creating a great operating system isn’t just about the applications and shell, but also about supporting the kind of hardware people want to use. A good example here is that we put a lot of effort into HiDPI support. HiDPI screens are not very common yet, but a lot of the new high end laptops coming out are using them already. Anyone who has used something like a Google Pixel or a Samsung Ativ Book 9 Plus has quickly come to appreciate the improved sharpness and image quality these displays brings. Due to the effort we put in there I have been very pleased to see many GNOME 3.12 reviews mentioning this work recently and saying that GNOME 3.12 is currently the best linux desktop for use with HiDPI systems due to it.

Another part of the puzzle for creating a better operating system is the software installation. The traditional distribution model often tended to try to bundle as many applications as possible as there was no good way for users to discover new software for their system. This is a brute force approach that assumes that if you checked the ‘scientific researcher’ checkbox you want to install a random collection of 100 applications useful for ‘scientific researchers’. To me this is a symptom of a system that does not provide a good way of finding and installing new applications. Thanks to the ardent efforts of Richard Hughes we have a new Software Installer that keeps going from strength to strength. It was originally launched in Fedora 19, but as we move forward towards the first Fedora Workstation release we are enabling new features and adding polish to it. One area where we need to wider Fedora community to work with us is to increase the coverage of appdata files. Appdata files essentially contains the necessary metadata for the installer to describe and advertise the application in question, including descriptive text and screenshots. Ideally upstreams should come with their own appdata file, but in the case where they are not we should add them to the Fedora package directly. Currently applications from the GTK+ and GNOME sphere has relatively decent appdata coverage, but we need more effort into getting applications using other toolkits covered too.

Which brings me to another item of importance to the workstation. The linux community has for natural reasons been very technical in nature which has meant that some things that on other operating systems are not even a question has become defining traits on Linux. The choice of GUI development toolkits being one of these. It has been a great tool used by the open source community to shoot ourselves in the foot for many years now. So while users of Windows or MacOS X probably never ask themselves what toolkit was used to implement a given application, it seems to be a frequently asked one for linux applications. So we want to move away from it with the Workstation. So while we do ship the GNOME Shell as our interface and use GTK+ for developing tools ourselves, including spending time evolving the toolkit itself that does not mean we think applications written using for instance Qt, EFL or Java are evil and should be exorcised from the system. In fact if an application developer want to write an application for the linux desktop at all we greatly appreciate that effort regardless of what tools they decide to use to do so. The choice of development toolkits is a choice meant to empower developers, not create meaningless distinctions for the end user. So one effort we have underway is to work on the necessary theming and other glue code to make sure that if you run a Qt application under the GNOME Shell it feels like it belongs there, which also extends to if you are using accessibility related setups like the high contrast theme. We hope to expand upon that effort both in width and in depth going forward.

And maybe on a somewhat related note we are also trying to address the elephant in the room when it comes to the desktop and that is the fact that the importance of the traditional desktop is decreasing in favor of the web. A lot of things that you used to do locally on your computer you are probably just doing online these days. And a lot of the new things you have started doing on your computer or other internet capable device are actually web services as opposed to a local applications. The old Sun slogan of ‘The Network is the Computer’ is more true today than it has ever been before. So we don’t believe the desktop is dead in any way or form, as some of the hipsters in the media like to claim, in fact we expect it to stay around for a long time. What we do envision though is that the amount of time you spend on webapps will continue to grow and that more and more of your computing tasks will be done using web services as opposed to local applications. Which is why we are continuing to deeply integrate the web into your desktop. Be that through things like GNOME Online accounts or the new Webapps that are introduced in Software installer. And as I have mentioned before on this blog we are also still working on trying to improve the integration of Chrome and Firefox apps into the desktop along the same lines. So while we want the desktop to help you use the applications you want to run locally as efficiently as possible, we also realize that you like us are living in a connected world and thus we need to help give you get easy access to your online life to stay relevant.

So there are of course a lot of other parts of the Fedora Workstation effort, but this has already turned into a very long blog post as it is so I leave the rest for later. Please feel free to post any questions or comments and I will try to respond.