planet.freedesktop.org
September 16, 2014
We've added a few, but nonetheless interesting features to Videos in GNOME 3.14.

Auto-rotation of videos

If you capture videos in portrait orientation on your phone, we are now able to rotate them automatically in the movie player, as well as in the thumbnails.

Better streaming

You can now seek anywhere inside streamed videos, even if we didn't download all the way to that point. That's particularly useful for long videos, or slow servers (or a combination of both).

Thumbnails generation

Finally, videos without thumbnails in your videos directory will have thumbnails automatically generated, without having to browse them in Files. This makes the first experience of videos more pleasing to the eye.

What's next?

We'll work on integrating Victor Toso's work on grilo plugins, to show information about the film or TV series on your computer, such as grouping episodes of a series together, showing genres, covers and synopsis for films.

With a bit of luck, we should also be able to provide you with more video content as well, through partners.
September 15, 2014

A Forest of X Server Changes

We’ve got about another month left in the X server merge window for 1.17 and I’ve written a small set of fixes which haven’t been reviewed yet for merging. I thought I’d advertise them a bit and see if I couldn’t encourage a few of you to take a look and see if they’re useful, correct and complete.

All of these are in my personal X server repository:

git://people.freedesktop.org/~keithp/xserver.git

Cleaning up the X Registry

Branch: registry-fixes

I’ll bet most of you don’t even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, “protocol.txt”, which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what’s wrong with this? Three basic things:

  1. A simple bug — protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren’t available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn’t in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We’ve had the ‘-nolisten’ option for a while now to disable inbound TCP connections. It’s useful for security reasons, but we’ve never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, ‘-listen’, is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we’ll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We’ll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GLALPHA to GLRED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn’t. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn’t handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code.

Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people.

Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s
 
OK
$ coverage report
Name Stmts Miss Cover
----------------------------------
test_whisper 134 4 97%
whisper 584 227 61%
----------------------------------
TOTAL 718 231 67%


While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric.

When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass.

No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects.

The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax.

A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light.

Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
fh = None
try:
fh = open(path,'rb')
return file_fetch(fh, fromTime, untilTime, now)
finally:
if fh:
fh.close()


That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
with open(path, 'rb') as fh:
return file_fetch(fh, fromTime, untilTime, now)


This way, the function looks actually so simple that one can even wonder why it should exists – but why not.

Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
if i == len(archiveList) - 1:
break


could be actually:

for i, archive in enumerate(itertools.islice(archiveList, len(archiveList) - 1):


That reduce the code size and makes it easier to read through the code.

Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level.

Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
# Set default params
if xFilesFactor is None:
xFilesFactor = 0.5
if aggregationMethod is None:
aggregationMethod = 'average'
 
#Validate archive configurations...
validateArchiveList(archiveList)
 
#Looks good, now we create the file and write the header
if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
fh = open(path,'wb')
if LOCK:
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )
 
aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
maxRetention = struct.pack( longFormat, oldest )
xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
archiveCount = struct.pack(longFormat, len(archiveList))
packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
fh.write(packedMetadata)
headerSize = metadataSize + (archiveInfoSize * len(archiveList))
archiveOffsetPointer = headerSize
 
for secondsPerPoint,points in archiveList:
archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
fh.write(archiveInfo)
archiveOffsetPointer += (points * pointSize)
 
#If configured to use fallocate and capable of fallocate use that, else
#attempt sparse if configure or zero pre-allocate if sparse isn't configured.
if CAN_FALLOCATE and useFallocate:
remaining = archiveOffsetPointer - headerSize
fallocate(fh, headerSize, remaining)
elif sparse:
fh.seek(archiveOffsetPointer - 1)
fh.write('\x00')
else:
remaining = archiveOffsetPointer - headerSize
chunksize = 16384
zeroes = '\x00' * chunksize
while remaining > chunksize:
fh.write(zeroes)
remaining -= chunksize
fh.write(zeroes[:remaining])
 
if AUTOFLUSH:
fh.flush()
os.fsync(fh.fileno())
finally:
if fh:
fh.close()


The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc.

That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled – and it is inefficient anyway.

There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files – even doing things like os.fsync() – while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible.

Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
# TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
# FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
fh = open(path,'wb')


That code should be:

try:
fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb')
except OSError as e:
if e.errno = errno.EEXIST:
raise InvalidConfiguration("File %s already exists!" % path)


to avoid any race condition.

Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
header = __readHeader(fh)
[...]


The first thing the function does is to read the header from the file handler. Let's take a look at that function:

def __readHeader(fh):
info = __headerCache.get(fh.name)
if info:
return info
 
originalOffset = fh.tell()
fh.seek(0)
packedMetadata = fh.read(metadataSize)
 
try:
(aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
except:
raise CorruptWhisperFile("Unable to read header", fh.name)
[...]


The first thing the function does is to look into a cache. Why is there a cache?

It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError.

One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO
 
packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)


The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted.

Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily.

The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
"""fetch(path,fromTime,untilTime=None)
[...]
"""
 
def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
"""create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""


This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument.

Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY.

Conclusion

There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted.

A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

September 14, 2014

And not the kingdom of Spain unfortunately (unfortunately because I miss it and because it's still a kingdom). In a few months (not sure about specific dates yet, probably in early 2015) I will be moving back to the United Kingdom, this time to the larger metropolis, London. Don't panic, I will still be with Red Hat, there won't be a lot of changes in that front. In the meantime I will settle back in Gran Canaria and will be flying back and forth on a monthly basis.

I must note that when I made the decision to move to Czech my plan was: "I do not have a plan", just enjoying it and trying to make the best of it without thinking in deadlines as to when to move back to Spain. Red Hat has been a very welcoming company in which I feel just like home and Brno has been a very welcoming city and this is definitively a part of Europe that is worth experiencing. I've met terrific people during this period both inside and outside Red Hat.

There was, however, a little problem.

Something altered the mid-term plans, a few months before I moved, when the decision was already made, I met someone very special with whom now I want to share my life with. After 16 months of  carrying a distant relationship it was due time to find a place where we could be together, after months of planning and considering options, London presented itself as the spot to make the move as she found a pretty good job there.

While I am going to miss sharing the office on a daily basis with awesome people, I am looking forward to this new chapter in my life.

Canary Wharf at Night | London, England, Niko Trinkhaus, (CC by-nc)

I want to note that I am deeply thankful to Christian Schaller for his tremendous amount of support during my stay in Brno and for working with me in figuring ways to balance my professional and personal life. I also wish him the best of luck with his new life in Westford, I'm certainly going to miss him.

On the other hand I guess this means I'll show up at the GNOME Beers in London more often :-)

September 11, 2014

Listaller-Logo (with text)It is time for another report on Listaller, the cross-distro 3rd-party package installer, which is now in development for – depending how you count – 5-6 years. This will become a longer post, so you might grab some coffee or tea ;-)

The original idea

The Listaller project was initially started with the goal to make application deployment on Linux distributions as simple as possible, by providing a unified package installation format and tools which make building apps for multiple distributions easier and deployment of updates simple. The key ideas were:

  • Seamless integration of all installation steps into the system – users shouldn’t care about the origin of their application, they just handle all installed apps with the same tool and update all apps with the same interface they use for updating the system.
  • Out-of-the-boy sandboxing for all 3rd-party apps
  • Easy signing and key-validation for Listaller packages
  • Simple creation of updates for developers
  • Resource-sharing: It should always be clear which application uses which library, duplicates should be avoided. The distribution-provided software should take priority, since it is often well-maintained and receives security updates.

The current state

The current release of Listaller handles all of this with a plugin for PackageKit, the cross-distro package-management abstraction layer. It hooks into PackageKit and reads information passing through to the native distributor backend, and if it encounters Listaller software, it handles it appropriately. It can also inject update information. This results in all Listaller software being shown in any PackageKit frontends, and people can work with it just like if the packages were native packages. Listaller package installations are controlled by a machine policy, so the administrator can decide that e.g. only packages from a trusted source (= GPG signature in trusted database) can be installed. Dependencies can be pulled from the distributor’s repositories, or optionally from external sources, like the PyPI.

This sounds good on paper, but the current implementation has various problems.

The issues

The current Listaller approach has some problems. The biggest one lies in the future: Soon, there will be no PackageKit plugins anymore! PackageKit 1.0 will remove support for them, because they appear to be a major source for crashes, even the in-tree plugins cause problems. Also, the PackageKit service itself is currently being trimmed of unneeded features and less-used code. These changes in PackageKit are great and needed for the project (and I support these efforts), but they cause a pretty huge problem for Listaller: The project relies on the PackageKit plugin – if used without it, you loose the system-integration part, which is one of the key concepts of Listaller, and a primary goal.

But this issue is not the only one. There are more. One huge problem for Listaller is dependency-solving: It needs to know where to get software from in case it isn’t installed already. And that has to be done in a cross-distributional way. This is an incredibly complex task, and Listaller contains lots of workarounds for various quirks. It contains so much hacks for distro-specific stuff, that it became really hard to understand. The Listaller dependency model also became very complex, because it tried to handle many corner-cases. This is bad, of course. But the workarounds weren’t added for fun, but because it was assumed to be easier than to fixing the root cause, which would have required collaboration between distributors and some changes on the stack, which seemed unlikely to happen at the time the code was written.

The systemd effort

Also a thing which affects Listaller, is the latest push from the systemd team to allow cross-distro 3rd-party installations to happen. I definitively recommend reading the linked blogpost from Lennart, if you have some spare time! The identified problems are the same as for Listaller, but the solution they propose is completely different, and about three orders of magnitude more invasive than whatever the Listaller project had in mind (I make these numbers up, so don’t ask!). There are also a few issues I see with Lennarts approach, I will probably go into detail about that in another blogpost (e.g. it requires multiple copies of a library lying around, where one version might have a security vulnerability, and another one doesn’t – it’s hard to ensure everything is up to date and secure that way, even if you have a top-notch sandbox). I have great respect for the systemd crew and especially Lennart, and I hope them to succeed with their efforts. However, I also think Listaller can achieve a similar things with a less-invasive solution, at least for the 3rd-party app-installations (Listaller is one of the partial-fix solutions with strict focus, so not a direct competitor to the holistic systemd approach. Both solutions could happily live together.)

A step into the future

Some might have guessed it already: There are some bigger changes coming to Listaller! The most important one is that there will be no Listaller anymore, at least not in its old form.

Since the current code relies heavily on the PackageKit plugin, and contains some ugly workarounds, it doesn’t make much sense to continue working on it.

Instead, I started the Listaller.NEXT project, which is a rewrite of Listaller in C. There are a some goals for the rewrite:

  • No stupid hacks and workarounds: We will not add any workaround. If there is a problem, we will fix it at its source, even if that might be more invasive.
  • Trimmed down project: The new incarnation of Listaller will only support installations of statically linked software at the beginning. We will start with a very small, robust core, and then add more features (like dependency-solving) gradually, but only if they are useful. There will be no feature-creep like in the previous version.
  • Faster development cycle: Releases will happen much faster, not only two or three times a year
  • Integration: Since there is no PackageKit plugin anymore, but integration is still one of Listaller’s key concepts, we will integrate Listaller into downstream tools, ranging from Apper to GNOME-Software. Richard Hughes will help with the integration and user interfaces, so Listaller applications get displayed properly.
  • AppStream-first: AppStream is the ultimate tool for Listaller to detect dependencies. With the 0.6 release, the Listaller component-concept was merged into it, which makes it a very powerful and non-hackish solution for dependency-detection. We will advance the use of its metadata, and probably use it exclusively, which would restrict Listaller to only work properly on distributions which ship AppStream metadata.
  • No desktop-only focus: The previous Listaller was focused only on desktop GUI apps. The new version will be developed with a much larger target audience in mind, including server deployments (“Can I use it to deploy my server app” is one very frequently asked questions about Listaller – with the new version, the answer is yes)
  • We will continue to improve the static-linking and cross-distro development toolchain (libuild, with ligcc, lig++ and binreloc), to make building portable apps easier.

I made a last release of the 0.5.x series of Listaller, to work with PackageKit 0.9.x – the future lies in the C port.

If you are using Listaller (and I know of people who do, for example some deploy statically-linked stuff on internal test-setups with it), stay tuned. The packaging format will stay mostly compatible with the current version, so you will not see many changes there (the plan is to freeze it very soon, so no backwards-incompatible changes are made anymore). The o.5.x series will receive critical bugfixes if necessary.

Help needed!

As always, there is help needed! Writing C is not that difficult ;-) But user feedback is welcome as well, in case you have an idea. The new code will be hosted on Github in the new listaller-next branch (currently not that much to find there). Long-term, we will completely migrate away from Launchpad.

You can expect more blogposts about the Listaller concepts and progress in the next months (as soon as I am done with some AppStream-related things, which take priority).

September 03, 2014
thereifixedit.com - Euro Ipod Charger
see more There I Fixed It

I try to fairly regularly build recent git checkouts of all the upstream modules from X.Org (at least all those listed in the current build.sh) on Solaris. Normally I do this in 32-bit mode on x86 machines using the Sun compilers on the latest Solaris 11 internal development build, but I also occasionally do it in 64-bit mode, or with gcc compilers, or on a SPARC machine. This helps me catch issues that would break our builds when we integrate the new releases before those releases happen. (Ideally I'd set up a Solaris client of the X.Org tinderbox, but I've not gotten around to that.)

Anyways, recently I finally decided to track down an error that only shows up in the 64-bit builds of the xscope protocol monitor/decoder for X11 on Solaris. The builds run fine up until the final link stage, which fails with:

ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol littleEndian: value 0x8086c355 does not fit
ld: fatal: relocation error: R_AMD64_PC32: file audio.o: symbol ServerHostName: value 0x8086b4fe does not fit
ld: fatal: relocation error: R_AMD64_PC32: file decode11.o: symbol LBXEvent: value 0x808664c3 does not fit
(and over 150 more symbols that didn't fit)

A google search turned up some forum posts, a blog post, and an article on the AMD64 ABI support in the Sun Studio compilers. And indeed, the solutions they offered did work - building with -Kpic did allow the program to link.

But is that really the best answer? xscope is a simple program, and shouldn't be overflowing the normal memory model. Once it linked, looking at the resulting binary was a bit shocking:

% /usr/gnu/bin/size  xscope
   text	   data	    bss	    dec	    hex	filename
 416753	   5256	2155921980	2156343989	808732b5	xscope

% /usr/bin/size -f xscope

23(.interp) + 32(.SUNW_cap) + 5860(.eh_frame_hdr) + 27200(.eh_frame)
 + 2964(.SUNW_syminfo) + 5944(.hash) + 4224(.SUNW_ldynsym)
 + 17784(.dynsym) + 14703(.dynstr) + 192(.SUNW_version)
 + 1482(.SUNW_versym) + 3168(.SUNW_dynsymsort) + 96(.SUNW_reloc)
 + 1944(.rela.plt) + 1312(.plt) + 291018(.text) + 33(.init) + 33(.fini)
 + 280(.rodata) + 38461(.rodata1) + 1376(.got) + 784(.dynamic)
 + 1952(.data) + 0(.bssf) + 1144(.picdata) + 0(.tdata) + 0(.tbss)
 + 2155921980(.bss) = 2156343989

% pmap -x `pgrep xscope`
26151:	./xscope
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        408        408          -          - r-x--  xscope
0000000000476000          8          8          8          - rw---  xscope
0000000000478000    2105388       1064       1064          - rw---  xscope
0000000080C83000         52         52         52          - rw---    [ heap ]
[....]
FFFFFD7FFFDF8000         32         32         32          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb    2108668       3204       1300          -

Two gigabytes of .bss space allocated!?!?! That can't be right. Looking through the output of the elfdump and nm programs a single symbol stood out:

Symbol Table Section:  .SUNW_ldynsym
     index    value              size              type bind oth ver shndx          name
[...]
      [89]  0x00000000009ff280 0x0000000080280000  OBJT GLOB  D    1 .bss           FDinfo

[Index]   Value                Size                Type  Bind  Other Shndx   Name
[...]
[528]   |            10482304|          2150105088|OBJT |GLOB |0    |28     |FDinfo

Unfortunately, that wasn't one of the ones listed in the linker errors, since it's starting address fit inside the normal memory model, but everything that came after it was out of range.

So what is this giant static allocation for? It's defined in scope.h:

#define BUFFER_SIZE (1024 * 32)

struct fdinfo
{
  Boolean Server;
  long    ClientNumber;
  FD      pair;
  unsigned char   buffer[BUFFER_SIZE];
  int     bufcount;
  int     bufstart;
  int     buflimit;     /* limited writes */
  int     bufdelivered; /* total bytes delivered */
  Boolean writeblocked;
};

extern struct fdinfo   FDinfo[StaticMaxFD];

So it allocates a 32k buffer for up to StaticMaxFD file descriptors. How many is that? For that we need to look in xscope's fd.h:

/* need to change the MaxFD to allow larger number of fd's */
#define StaticMaxFD FD_SETSIZE

and from there to the Solaris system headers, which define FD_SETSIZE in <sys/select.h>:

/*
 * Select uses bit masks of file descriptors in longs.
 * These macros manipulate such bit fields.
 * FD_SETSIZE may be defined by the user, but the default here
 * should be >= NOFILE (param.h).
 */
#ifndef FD_SETSIZE
#ifdef _LP64
#define FD_SETSIZE      65536
#else
#define FD_SETSIZE      1024
#endif  /* _LP64 */

So this makes the buffer fields alone in FDinfo become 65536 * 32 * 1024 bytes, aka 2 gigabytes.

Thus in this case, while compiler flags like -Kpic allow the code to link, using -DFD_SETSIZE=256 instead, builds code that's a little bit saner, fits in the normal memory model, and is less likely to fail with out of memory errors when you need it most:

% /usr/gnu/bin/size -f xscope
   text	   data	    bss	    dec	    hex	filename
 409388	   3352	8449804	8862544	 873b50	xscope

% pmap -x `pgrep xscope`
         Address     Kbytes        RSS       Anon     Locked Mode   Mapped File
0000000000400000        404        404          -          - r-x--  xscope
0000000000475000          4          4          4          - rw---  xscope
0000000000476000       8248         20         20          - rw---  xscope
0000000000C84000         52         52         52          - rw---    [ heap ]
[...]
FFFFFD7FFFDFD000         12         12         12          - rw---    [ stack ]
---------------- ---------- ---------- ---------- ----------
        total Kb      11500       2136        232          -

Of course that assumes that xscope is not going to be monitoring more than about 120 clients at a time (since it opens two file descriptors for each client, one connected to the client and one to the real X server), and still wastes many page mappings if you're only monitoring one client. The real fix being worked on for the next upstream release is to make the buffer allocation be dynamic, and allocate just enough for the number of clients we actually are monitoring.

The moral of this story? Just because you can make it build doesn't mean you've fixed it well, and sometimes it's useful to understand why the linker is giving you a hard time.

September 02, 2014
I've had a couple of questions about whether there's a way for others to contribute to the VC4 driver project.  There is!  I haven't posted about it before because things aren't as ready as I'd like for others to do development (it has a tendency to lock up, and the X implementation isn't really ready yet so you don't get to see your results), but that shouldn't actually stop anyone.

To get your environment set up, build the kernel (https://github.com/anholt/linux.git vc4 branch), Mesa (git://anongit.freedesktop.org/mesa/mesa) with --with-gallium-drivers=vc4, and piglit (git://anongit.freedesktop.org/git/piglit).  For working on the Pi, I highly recommend having a serial cable and doing NFS root so that you don't have to write things to slow, unreliable SD cards.

You can run an existing piglit test that should work, to check your environment: env PIGLIT_PLATFORM=gbm VC4_DEBUG=qir ./bin/shader_runner tests/shaders/glsl-algebraic-add-add-1.shader_test -auto -fbo -- you should see a dump of the IR for this shader, and a pass report.  The kernel will make some noise about how it's rendered a frame.

Now the actual work:  I've left some of the TGSI opcodes unfinished (SCS, DST, DPH, and XPD, for example), so the driver just aborts when a shader tries to use them.  How they work is described in src/gallium/docs/source/tgsi.rst. The TGSI-to_QIR code is in vc4_program.c (where you'll find all the opcodes that are implemented currently), and vc4_qir.h has all the opcodes that are available to you and helpers for generating them.  Once it's in QIR (which I think should have all the opcodes you need for this work), vc4_qpu_emit.c will turn the QIR into actual QPU code like you find described in the chip specs.

You can dump the shaders being generated by the driver using VC4_DEBUG=tgsi,qir,qpu in the environment (that gets you 3/4 stages of code dumped -- at times you might want some subset of that just to quiet things down).

Since we've still got a lot of GPU hangs, and I don't have reset wokring, you can't even complete a piglit run to find all the problems or to test your changes to see if your changes are good.  What I can offer currently is that you could run PIGLIT_PLATFORM=gbm VC4_DEBUG=norast ./piglit-run.py tests/quick.py results/vc4-norast; piglit-summary-html.py --overwrite summary/mysum results/vc4-norast will get you a list of all the tests (which mostly failed, since we didn't render anything), some of which will have assertion failed.  Now that you have which tests were assertion failing from the opcode you worked on, you can run them manually, like PIGLIT_PLATFORM=gbm /home/anholt/src/piglit/bin/shader_runner /home/anholt/src/piglit/generated_tests/spec/glsl-1.10/execution/built-in-functions/vs-asin-vec4.shader_test -auto (copy-and-pasted from the results) or PIGLIT_PLATFORM=gbm PIGLIT_TEST="XPD test 2 (same src and dst arg)" ./bin/glean -o -v -v -v -t +vertProg1 --quick (also copy and pasted from the results, but note that you need the other env var for glean to pick out the subtest to run).

Other things you might want eventually: I do my development using cross-builds instead of on the Pi, install to a prefix in my homedir, then rsync that into my NFS root and use LD_LIBRARY_PATH/LIBGL_DRIVERS_PATH on the Pi to point my tests at the driver in the homedir prefix.  Cross-builds were a *huge* pain to set up (debian's multiarch doesn't ship the .so symlink with the libary, and the -dev packages that do install them don't install simultaneously for multiple arches), but it's worth it in the end.  If you look into cross-build, what I'm using is rpi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-gcc and you'll want --enable-malloc0returnsnull if you cross-build a bunch of X-related packages.
September 01, 2014

So two years ago my family and I moved to Brno in the Czech Republic due to me starting a new job at Red Hat. It has been two roller coaster years with a lot of changes happening both inside Red Hat and with the world that the Linux desktop operates in. During those years my wife and I have gotten to love Brno, which both of us find a bit surprising as we where both quite skeptical to the city in the outset.

I think having grown up in west europe during the cold war I had some preconceptions about what life was like in the former east europe and Brno specifically is struggling a bit with being the second city in Czech after Prague, due to Prague so often being hailed internationally as a beautiful and exciting city.

But I think during these two years Brno has proven itself to us as a place that is great to live, especially if you have a little child. Brno has a lot of beautiful outdoors areas which are great for hiking or relaxing, it is packed full of these childrens cafes where you can take your kid to play while you sit down and have a coffee or a tea, a vibrant expat community, affordable housing, a good range of restaurants, short distance to major cities like Vienna, Prague and Budapest. And lot of old castles and towns around to explore in the vicinity. I think Telc has to be one of our topmost favorites in that regard. And it has very little crime, my wife has been telling her friends how Brno is the first city she has ever lived in where she feels that as a woman she can walk along through the city in the evening or at night and feel safe.

But that said the time has come for us to move on. Due to one of these changes inside Red Hat I mentioned I am getting moved to our US Engineering office in Westford, Massachusetts. For those not familiar with Westford it is close to a city you probably do know, Boston.

So tomorrow the moving company will arrive at our flat here in Brno and pack up everything for the transport to the US. The furniture will take some time to arrive there, so while our stuff is sailing across the ocean we will live with my family in Norway, while I take advantage of the Red Hat office in downtown Oslo. So by mid-October I expect us to be fully set up in the Boston area, although we are heading over there next week for a final house hunting trip so that the furniture has a place to arrive to :)

So goodbye to Brno for now, and looking forward to seeing new and old friends in Boston!

August 31, 2014

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.

  • Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.

  • The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.

  • We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.

  • We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> -- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The <vendorid> field is replaced by some vendor identifier, maybe a scheme like org.fedoraproject.FedoraWorkstation. The <architecture> field specifies a CPU architecture the OS is designed for, for example x86-64. The <version> field specifies a specific OS version, for example 23.4. An example sub-volume name could hence look like this: usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> -- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The <name> field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> -- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the usr sub-volumes explained above, however, while a usr sub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> -- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example: framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> -- This encapsulates an application bundle. It contains a tree that at runtime is mounted to /opt/<vendorid>, and contains all the application's resources. The <vendorid> could be a string like org.libreoffice.LibreOffice, the <runtime> refers to one the vendor id of one specific runtime the application is built for, for example org.gnome.GNOME3_20:3.20.1. The <architecture> and <version> refer to the architecture the application is built for, and of course its version. Example: app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> -- This sub-volume shall refer to the home directory of the specific user. The <user> field contains the user name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

  • When booting up a system we mount the root directory from one of the root sub-volumes, and then mount /usr from a matching usr sub-volume. Matching here means it carries the same <vendor-id> and <architecture>. Of course, by default we should pick the matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a usr sub-volume with a root sub-volume.

  • When we enumerate the system's users we simply go through the list of home snapshots.

  • When a user authenticates and logs in we mount his home directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the app sub-volume to /opt/<vendorid>/, and the appropriate runtime sub-volume the app picked to /usr, as well as the user's /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where /usris mounted from the framework sub-volume, and /home/$USER from his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.

  • We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.

  • We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at linux.conf.au, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!

August 29, 2014
We currently have a large influx of new people contributing to i915 - for the curious just check the git logs. As part of ramping them up I've done a few trainings about upstream review, and a bunch of people I've talked with at KS in Chicago were interested in that, too. So I've cleaned up the slides a bit and dropped the very few references to Intel internal resources. No speaker notes or video recording, but I think this is useful all in itself. And of course if you have comments or see big gaps - feedback is very much welcome:

Upstream Review Training Slides
August 21, 2014
Today I finally got X up on my vc4 driver using glamor.  As you can see, there are a bunch of visual issues, and what you can't see is that after a few frames of those gears the hardware locked up and didn't come back.  It's still major progress.
2014-08-21 16.16.37
The code can be found in my vc4 branch of mesa and linux-2.6, and the glamor branch of my xf86-video-modesetting.  I think the driver's at the point now that someone else could potentially participate.  I've intentionally left a bunch of easy problems -- things like supporting the SCS, DST, DPH, and XPD opcodes, for which we have piglit tests (in glean) and are just a matter of translating the math from TGSI's vec4 instruction set (documented in tgsi.rst) to the scalar QIR opcodes.
August 19, 2014

I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.

OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere.

It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit).

Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.

And voilà, my statistics are now including all my contributions to OpenStack!

August 18, 2014

A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning.

In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it.

Ceilometer early design

For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the collector.

This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support.

The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want.

The scalability issue

We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful.

It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable.

During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.

Thinking outside the box

Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box.

When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics.

Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging.

Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter.

Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right.

I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet – this made it a good challenge.

The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API – as we do in all OpenStack services. This should cover the metric storage pretty well.

Cooking Gnocchi

A cloud of gnocchis!

At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right?

The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process.

The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard.

One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow.

I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both.

The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable.

Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design.

However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (time.time()). I've started to hack on that in my own fork, but… then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage).

I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers.

Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way.

Handling resources

Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.

Relationship of entities and resources

Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now.

Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network.

All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed.

We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes.

Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.

The indexer classes and their relations

REST API

All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer.

The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.

Macroscopic view of the Gnocchi architecture

Roadmap & Ceilometer integration

All of this plan has been exposed and discussed with the Ceilometer team during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue.

It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to.

Early benchmarks

Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July.

The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.

For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.

Next steps

Clément drawing the logo

While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now.

The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.

We have an opening in our Graphics Team to work on improving the state of open source GPU drivers. Your tasks would include working on various types of hardware and make sure it works great under linux and improving the general state of the Linux graphics stack. Since the work would include working on some specific pieces of hardware it would require the candidate to relocate to our Westford office, just north of Boston.

We are open to candidates with a range of backgrounds, but of course previous history with linux kernel codebase or the X.org codebase or Wayland is an advantage.

Please contact me at cschalle-at-redhat-com if you are interested.

August 16, 2014

There hasn’t been a progress-report on DEP-11 for some time, but that doesn’t mean there was no work going on on it.

DEP-11 is Debian’s implementation of AppStream, as well as an effort to enhance the metadata available about software in Debian. While initially, AppStream was only about applications, DEP-11 was designed with a larger scope, to collect data about libraries, binaries and things like Python modules. Now, since AppStream 0.6, DEP-11 and AppStream have essentially the same scope, with the difference of DEP-11 metadata being described in YAML, while official AppStream data is XML. That was due to a request by our ftpmasters team, which doesn’t like XML (which is also not used anywhere in Debian, as opposed to YAML). But this doesn’t mean that people will have to deal with the YAML file format: The libappstream library will just take DEP-11 data as another data source for it’s Xapian database, allowing anything using libappstream to access that data just like the XML stuff. Richards libappstream-glib will also receive support for the DEP-11 format soon, filling it’s in-memory data cache and enabling the use of GNOME-Software on Debian.

So, what has been done so far? The past months, my Google Summer of Code student. Abhishek Bhattacharjee, was working hard to integrate DEP-11 support into dak, the Debian Archive Kit, which maintains the whole Debian archive. The result will be an additional metadata table in our internal Postgres database, storing detailed information about the software available in a Debian package, as well as “Components-<arch>.yml.gz” files in the Debian repositories. Dak will also produce an application icon-cache and a screenshots repository. During the time of the SoC, Abhishek focused mainly on the applications part of things, and less on the other components (like extracting data about Python modules or libraries) – these things can easily be implemented later.

The remaining steps will be to polish the code and make it merge-ready for Debian’s dak (as soon as it has received enough testing, we will likely give it a try on the Tanglu Debian derivative). Following that, Apt will be extended to fetch the DEP-11 data on-demand on systems where it is useful (which is currently mostly desktop-systems) – if you want to save a little bit of space, you will be able to disable downloading this extra metadata in Apt. From there, libappstream will take the data for it’s Xapian db. This will lead to the removal of the much-hated (from ftpmasters and maintainers side) app-install-data package, which has not been updated for two years and only contains a small fraction of the metadata provided by DEP-11.

What Debian will ultimately gain from this effort is support for software-centers like GNOME-Software, and improved support for tools like Apper and Muon in displaying applications. Long-term, with more metadata being available, It would be cool to add support for it to “specialized package managers”, like Python’s pip, npm or gem, to make them fetch information about available distribution software and install that instead of their own copies from 3rd-party repositories, if possible. This should ultimatively lead to less code duplication on distributions and will likely result in fewer security issues, since the officially maintained and integrated distribution packages can easily be used, if possible. This is no attempt to make tools like pip obsolete, but an attempt to have the different tools installing software on your machine communicate better, instead of creating parallel worlds in terms of software management. Another nice sideeffect of more metadata will be options to search for tools handling mimetypes in the software repos (in case you can’t open a file), smart software centers installing missing firmware, and automatic suggestions for developers which software they need to install in order to build a specific software package. Also, the data allows us to match software across distributions, for that, I will have some news soon (not sure how soon though, as I am currently in thesis-writing-mode, and therefore have not that much spare time). Since the goal is to have these features available on all distributions supporting AppStream, it will take longer to realize – but we are on a good way.

So, if you want some more information about my student’s awesome work, you can read his blogpost about it. He will also be at Debconf’14 (Portland). (I can’t make it this time, but I surely won’t miss the next Debconf)

Sadly, I only see a very small chance to have the basic DEP-11 stuff land in-time for Jessie (lots of review work needs to be done, and some more code needs to be written), but we will definitively have it in Jessie+1.

A small example on how this data will look like can be found here – a larger, actual file is available here. Any questions and feedback are highly appreciated.

August 15, 2014

Everyone has been blogging about GUADEC, but I’d like to talk about my other favorite conference of the year, which is GNOME.Asia. This year, it was in Beijing, a mightily interesting place. Giant megapolis, with grandiose architecture, but at the same time, surprisingly easy to navigate with its efficient metro system and affordable taxis. But the air quality is as bad as they say, at least during the incredibly hot summer days where we visited.

The conference itself was great, this year, co-hosted with FUDCon’s asian edition, it was interesting to see a crowd that’s really different from those who attend GUADEC. Many more people involved in evangelising, deploying and using GNOME as opposed to just developing it, so it allows me to get a different perspective.

On a related note, I was happy to see a healthy delegation from Asia at GUADEC this year!

Sponsored by the GNOME Foundation

August 10, 2014

Hello,

I’ll talk again about the interface between the Linux kernel and the userspace (mesa). After few weeks of work, I now have a full implementation which exposes NVIDIA’s performance counters in Nouveau. I actually have two versions with different approachs. The first one is almost “all-userspace” which means that the configuration and the logic of performance counters are stored in the userspace, while the second one is almost “all-kernelspace” and only exposes what events can be monitored from the userspace. These two approachs use a set of software methods and the perfmon engine of Nouveau, initially written by Ben Skeggs, in order to set up performance counters.

This post will only focus on global counters, please refer to my latest article about MP counters on nv50/Tesla if you are interested. Before we continue, let me recall what is a performance counter for NVIDIA.

PCOUNTER: The performance counters engine

A hardware performance counter is a set of special registers which are used to store the counts of hardware-related activities. Hardware counters are oftenly used by developers to identify bottlenecks in their applications.

PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters. Counters do not sample one 8-bits signal, they sample a macro signal. A macro signal is the aggregation of 4 signals which have been combined using a function. An overview of this logic is represented by the figure below.

pcounter

Now, let me talk a bit about graphics counter exposed by NVIDIA on nv50/Tesla family.

Graphics counter for 3D applications

Graphics counter can be used to give detailled information for OpenGL/Direct3D applications. These performance counters are only exposed by NVIDIA PerfKit, an advanced software suite for profiling OpenCL and Direct3D/OpenGL applications on Windows (only). Last year, I reverse engineered most of these graphics counter. You can take a quick look at the documentation for nva3 (for example), this will introduce the notion of complex hardware events.

Overview of complex hardware events

A complex hardware event is composed by one or two macro signals which have been combined with a counter mode. Some of them are sometimes multiplexed and thus a multiplexer (a tuple address and value) needs to be configured in the engine which generates the signal. Hardware events are so the aggregation of multiple 8-bits signals and they are harder to monitor than a simple signal. Some events are also too complex to be monitored at one time and thus need multiple passes. As perfkit polls counters after each frame, an event that requires multiple passes will need the same amount of frame to be monitored. For instance, for frame x, the counters are set for the pass #0 while they are set up for pass #1 at frame x+1. The results of the two passes are then combined to create the result of the event. Multi-passes events are thus less accurate because they need more frames to be monitored

The main goal of the interface between the kernel and mesa is to expose these complex hardware events to the userspace.

The first interface (“all-userspace” approach)

The main idea of this interface is to store the configuration of complex hardware events inside mesa. In this approach, the kernel only knows the list of 8-bits signals and exposes them with a unique string identifier, for example, the signal 0xcb on nva3 is associated to ‘gr_idle’ on the set 1. Then, the userspace can build complex events and send the configuration to the kernel through an ioctl call which allocates a NOUVEAU_PERFCTR_CLASS object. A NOUVEAU_PERFCTR_CLASS object is used to init, poll and read performance counters.

This interface is based on a set of softwared methods used to control performance counters. Basically, we first allocate a NOUVEAU_PERFCTR_CLASS object with the configuration (8-bits signal/function/mode …) of the counter. Then, before a frame is rendering (using the begin_query() hook of gallium) we send the handle of this object with a software method to start monitoring. At this time, the configuration is written to PCOUNTER and the counter starts to count hardware related activities. After the frame, we send a sequence number with an another software method to read out values using a notify buffer object which is allocated along the current channel. If you are interested, a previous post gives more details about that interface.

With this “all-userspace” approach, the kernel is not able to monitor complex hardware events because the configuration and the logic is stored in the userspace. Actually, the configuration is shared between the kernel and mesa. The kernel only knows 8-bits signals while the userspace knows the configuration of hardware events.

Perf also called perf_events, is a kernel-based interface for profiling Linux which is able to monitor performance counters like the number of instructions executed. Thus, if the configuration of hardware events is stored in the userspace, this will be a problem for exposing them in perf because we don’t want to duplicate the configuration. I also talked with Daniel Vetter, the maintener of the i965 driver and the responsible of the major part of DRM, and he seems to be agree with the idea that it could be good to expose hardware events in perf.

We also have an another problem related to muxs because the userspace knows the configuration while the kernel does not. So, the kernel has to check address of muxs in order to avoid security issues.

The last problem is that the interface is closely based on the perfmon engine, so if perfmon changes in the future, this will require to add a new interface. But, we don’t want to add another driver private ioctl or design a new interface in case of perfmon must be evolved in the future. However, with the “all-kernelspace” approach we don’t have this problem since the kernel knows the logic and only exposes a list of monitorable events.

However, the “all-userspace” approach has the advantages to reduce the amount of code in the kernel and to facilitate the configuration of counters since all the logic is located in the userspace.

If you are interested you can take a look at the code :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_pcounter_pm

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfctr_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/expose_perfctr_class

The second interface (“all-kernelspace” approach)

This interface is kernel-based like Perf. The configuration and the logic (except multi-pass events which need two frames) are stored in the kernel only. The kernel exposes a list of monitorable events. Thus, the userspace just has to allocate a NOUVEAU_PERFEVENT_CLASS used to init, read and poll complex hardware events.

Like the previous interface, this is one is also based on a set of software methods used to control performance counters. The behaviour is almost the same than before except that we allocate a NOUVEAU_PERFEVENT_CLASS object which represents a complex hardware event instead of a NOUVEAU_PERFCTR_CLASS.

With this approach it’s easy to monitor complex hardware events inside Nouveau and to expose them to Perf in the future. Also, there is no security issues because muxs are configured from and by the kernel, we don’t have to check their address.

Since, the kernel only exposes a list of events and stores the configuration, pefmon can change without any impacts to the interface between the kernel and the userspace in the future. Basically, the userspace only knows the name of events, and some flags used to do scheduling. However, it’s hard to expose to the userspace what events are monitorable simultaneously or not.

On nv50/Tesla, we have 8 domains (or sets) and 4 counters per domain. Thus, if all complex events only use one counter per domain, we can monitor 32 events simultaneously. Good! But actually not… Because some events use 2 counters per domain. To handle this case, the userspace can retrieve the number of available domains and the number of counters per domain through an ioctl call. Then, we expose the domain ID and the number of counters needed by an event. With this information, we can schedule events from the userspace. But we still have one problem, how to handle the case where two events on the same domain share a mux?

Some events are multiplexed but two or more events can use the same mux with a different value. To handle this special case, we expose conflicts to the userspace using some 64 bits flags. Thus, the userspace just has to do a AND comparison to check if two events can be monitored simultaneously.

The source code of this “all-kernelspace” version is available below :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_kernelspace_version

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfevent_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/nv50_kernelspace_version

What is the best approach ? pros & cons

“all-userspace” approach

 Pros:
  • reduce the amount of code in the kernel
  • easy to apply logic of performance counters
 Cons:
  • not possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration of counters is shared between the userspace (complex events) and the kernelspace (8-bits signals)
  • possible security issues (the kernel must know address of muxs to check queries)
  • the interface (and the userspace) must be changed if perfmon changes in the future

“all-kernelspace” approach

 Pros:
  • possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration and logic (except multi-pass events) are stored in the kernel only
  • no security issues (muxs are configured by the kernel)
  • perfmon can evolve without any impacts regarding the interface since it only exposes a list of events
 Cons:
  • add more code in the kernel
  • hard to expose to the userspace what events are monitorable simultaneously or not

These two interfaces have different pros and cons, but in my opinion, I think the “all-kernelspace” is more elegant and more future-proof since we can monitor complex hardware events inside Nouveau and expose them to perf (Linux) .

To sum up, we still have to choose one version of the interface between the kernel and mesa. I’ll talk about this with Ben Skeggs, the maintener of Nouveau to get his opinion. We hope to get the code upstream in september or october, and before Linux 3.19.

Have a good day!


August 08, 2014

266px-Wayland_Logo.svgAt Collabora, we’re always on the lookout for cool opportunities involving Wayland and we noticed recently that Mozilla had started to show some interest in porting Firefox to Wayland. In short, the Wayland display server is becoming very popular for being lightweight, versatile yet powerful and is designed to be a replacement for X11. Chrome and Webkit already got Wayland ports and we think that Firefox should have that too.

Some months ago, we wrote a simple proof-of-concept basically starting from actual Gecko’s GTK3 paths and stripping all the MOZ_X11 ifdefs out of the way. We did a bunch of quick hacks fixing broken stuff but rather easily and quickly (couple days), we got Firefox to run on Weston (Wayland official reference compositor). Ok, because of hard X11 dependencies, keyboard input was broken and decorations suffered a little, but that’s a very good start! Take a look at the below screenshot :)

firefox-on-wayland


August 07, 2014

Over the past few months, working at Collabora, I have helped Mozilla get rid of Xlib surfaces for content on Linux platform. This task was the primary problem keeping Mozilla from turning OpenGL layers on by default on Linux, which is one of their long-term goals. I’ll briefly explain this long-term goal and will thereafter give details about how I got rid of Xlib surfaces.

LONG-TERM GOAL – Enabling Skia layers by default on Linux

My work integrated into a wider, long-term goal that Mozilla currently has : To enable Skia layers by default on Linux (Bug 1038800). And for a glimpse into how Mozilla initially made Skia layers work on linux, see bug 740200. At the time of writing this article, Skia layers are still not enabled by default because there are some open bugs about failing Skia reftests and OMTC (off-main-thread compositing) not being fully stable on linux at the moment (Bug 722012). Why is OMTC needed to get Skia layers on by default on linux ?  Simply because by design, users that choose OpenGL layers are being grandfathered OMTC on Linux… and since the MTC (main-thread compositing) path has been dropped lately, we must tackle the OMTC bugs before we can dream about turning Skia layers on by default on Linux.

For a more detailed explanation of issues and design considerations pertaining turning Skia layers on by default on Linux, see this wiki page.

MY TASK – Getting rig of Xlib surfaces for content

Xlib surfaces for content rendering have been used extensively for a long time now, but when OpenGL got attention as a means to accelerate layers, we quickly ran into interoperability issues between XRender and Texture_From_Pixmap OpenGL extension… issues that were assumed insurmountable after initial analysis. Also, and I quote Roc here, “We [had] lots of problems with X fallbacks, crappy X servers, pixmap usage, weird performance problems in certain setups, etc. In particular we [seemed] to be more sensitive to Xrender implementation quality that say Opera or Webkit/GTK+.” (Bug 496204)

So for all those reasons, someone had to get rid of Xlib surfaces, and that someone was… me ;)

The Problem

So problem was to get rid of Xlib surfaces (gfxXlibSurface) for content under Linux/GTK platform and implicitly, of course, replace them with Image surfaces (gfxImageSurface) so they become regular memory buffers in which we can render with GL/gles and from which we can composite using GPU. Now, it’s pretty easy to force creation of Image surfaces (instead of Xlib ones) for just all content layers in gecko gfx/layers framework, just force gfxPlatformGTK::CreateOffscreenSurfaces(…) to create gfxImageSurfaces in any case.

Problem is, naively doing so gives rise to a series of perf. regressions and sub-optimal paths being taken, for example, to copy image buffers around when passing them across process boundaries, or unnecessary copying when compositing under X11 with Xrender support. So the real work was to fix everything after having pulled the gfxXlibSurface plug ;)

The Solution

First glitch on the way was that GTK2 theme rendering, per design, *had* to happen on Xlib surfaces. We didn’t have much choice as to narrow down our efforts to the GTK3 branch alone. What’s nice with GTK3 on that front is that it makes integral use of cairo, thus letting theme rendering happen on any type of cairo_surface_t. For more detail on that decision, read this.

Upfront, we noticed that the already implemented GL compositor was properly managing and buffering image layer contents, which is a good thing, but on the way, we saw that the ‘basic’ compositor did not. So we started streamlining basic compositor under OMTC for GTK3.

The core of the solution here was about implementing server-side buffering of layer contents that were using image backends. Since targetted platform was Linux/GTK3 and since Xrender support is rather frequent, the most intuitive thing to do was to subclass BasicCompositor into a new X11BasicCompositor and make it use a new specialized DataTextureSource (that we called X11DataTextureSourceBasic) that basically buffers upcoming layer content in ::Update() to an gfxXlibSurface that we keep alive for the TextureSource lifetime (unless surface changes size and/or format).

Performance results were satisfying. For 64 bit systems, we had around 75% boost in tp5o_shutdown_paint, 6% perf gain for ‘cart’, 14% for ‘tresize’, 33% for tscrollx and 12% perf gain on tcanvasmark.

For complete details about this effort, design decisions and resulting performance numbers, please read the corresponding bugzilla ticket.

To see the code that we checked-in to solve this, look at those 2 patches :

https://hg.mozilla.org/mozilla-central/rev/a500c62330d4

https://hg.mozilla.org/mozilla-central/rev/6e532c9826e7

Cheers !

 



  • If you have an orientation sensor in your laptop that works under Windows 8, this tool might be of interest to you.
  • Mattias will use that code as a base to add Compass support to Geoclue (you're on the hook!)
  • I've made a hack to load games metadata using Grilo and Lua plugins (everything looks like nail when you have a hammer ;)
  • I've replaced a Linux phone full of binary blobs by another Linux phone full of binary blobs
  • I believe David Herrmann missed out on asking for a VT, and getting something nice in return.
  • Cosimo will be writing some more animations for me! (and possibly for himself)
  • I now know more about core dumps and stack traces than I would want to, but far less than I probably will in the future.
  • Get Andrea to approve Timm Bädert's git account so he can move Corebird to GNOME. Don't forget to try out Charles, Timm!
  • My team won FreeFA, and it's not even why I'm smiling ;)
  • The cathedral has two towers!
Unfortunately for GUADEC guests, Bretzel Airlines opened its new (and first) shop on Friday, the last days of the BoFs.

(Lovely city, great job from Alexandre, Nathalie, Marc and all the volunteers, I'm sure I'll find excuses to come back :)
So with the 3.16 kernel out of the door it's time to look at what's queued up for the Intel graphics driver in 3.17.

This release features the universal plane support from Matt Roper, all enabled already by default. This is prep work for atomic modesetting and pageflipping support: Since a while we support additional (overlay) planes in the DRM core and the i915 driver, but there have always been two implicit planes directly attached to the CRTC: The primary plane used by the SetCrtc and PageFlip functions, and the optional cursor support. But with the atomic ioctl these implicit planes it's easier to handle everything as an explicit plane, so Matt's patches split them away into separate real plane objects. This is a nice cleanup of the kms api in general since a lot of SoC hardware has unified plane hardware, where cursor, primary plane and any overlays are fully interchangeable. So we already expose this to userspace, if it sets the corresponding feature flag.

Another big feature on the display side is the improved PSR support, which is now enabled by default on Haswell and Broadwell. The tricky bit with PSR (and also with FBC) and the reason we didn't yet enable this by default is correctly support legacy frontbuffer rendering (for example for X). The hardware provides a bit of support to do that, but it doesn't catch all possible frontbuffer rendering and has a lot of other limitations. To finally fix this for real we've added accurate frontbuffer tracking in software. This should finally allow us to enable a lot of display power saving features by default like PSR on Baytrail, FBC (on all platforms) and DRRS (dynamic refresh rate switching).

On actual platform display enabling we have lots of improvements all over: Baytrail MIPI DSI support has greatly stabilized, backlight and power sequencer fixes, mmio based flips to work around issues with stalls and hangs for blitter ring based flips and plenty of other work. The core drm pieces for plane rotation support have also landed, unfortunately the i915 parts didn't make the cut for 3.17.

Another big area, as usual, has been general power management improvements. We now support runtime PM for DPMS Off and not just when the output is completely disabled. This was fairly invasive work since our current modesetting code assumed that a DPMS Off/On cycle will not destroy register state, but that's exactly what runtime PM can do. On the plus side this reorganization greatly cleaned up the code base and prepared the driver for atomic modesetting, which requires a similar separation between state computation and actual hw state updating like this feature.

Jesse Barnes implemented S0ix support for system suspend/resume. Marketing has some crazy descriptions for this, but essentially this means that we use the same power saving knobs for system suspend as for runtime PM - the entire machine is still running, just at a very low power state. Long-term this should simplify our system suspend code a bit since we can just reuse all the code used to implement runtime PM.

Moving on to the render side of the gpu there have been again improvements to the rps code. Chris Wilson further tuned the rps boost logic, and Ville and Deepak implemented rps support for Cherrytrail.
Jesse contributed ppgtt support for Baytrail which will be a lot more interesting once we enable full ppgtt again (hopefully in 3.18).

For Broadwell semaphores support from Ben and Rodrigo was merged, but it looks like we need to disable that again due to stability issues. Oscar Mateo also implemented a large pile of interrupt handling improvements which hopefully address the small races and bugs we've had in the past on some platforms. There's also a lot of refactoring patches to prepare for execlist support from Oscar. Excelists are the new way of submitting work to the gpu, first supported on Broadwell (but not yet mandatory). The key feature compared to legacy ringbuffer submission is that we'll finally be able to preempt gpu tasks.

And as usual there have been tons of bugsfixes and improvements all over. Oh and: User mode setting has moved one step further on the path to deprecation and is now fully disabled. If no one complains about this we can finally rip out all that code in one of the next kernel releases.
August 04, 2014

In April, when Solaris 11.2 Beta was released, I posted a list of changes to bundled software packages between Solaris 11.1 & 11.2. Now that the final release of Solaris 11.2 reached General Availability last week, I've gone back to compare the beta release via the GA release.

As you would expect, there are many fewer changes in the three months between beta & GA than in the 18 months before that. Most of the change came from upgrading the OpenStack packages from the Grizzly (2013.1) release to the Havana (2013.2) release, and adding the Swift OpenStack Object Storage components and other packages like Django which the new OpenStack components needed. There are also some general bug fix or security fix updates, such as upgrading OpenSSL from 1.0.1g to 1.0.1h.

One other change that showed up when gathering data for this list was that the Oracle Database 12c prerequisites package was renamed between beta & GA to better match the database naming style - previously it was called group/prerequisite/oracle/oracle-rdbms-server-12cR1-preinstall but is now group/prerequisite/oracle/oracle-rdbms-server-12-1-preinstall. Fortunately, you don't have to type in the whole FMRI to install it, pkg install oracle-rdbms-server-12-1-preinstall is enough.

Detailed list of changes

This table shows most of the changes to the bundled packages between the 11.2 beta released in April, and the 11.2 GA release in July.

As before, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.

PackageUpstream11.2 Beta11.2 GA
cloud/openstack/cinder OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/glance OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/horizon OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/keystone OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/neutron OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/nova OpenStack 0.2013.1.4 0.2013.2.3
cloud/openstack/swift OpenStack not included 1.10.0
developer/java/jdk-7 Java 1.7.0.55.13
(Java SE 7u55)
1.7.0.65.17
(Java SE 7u65)
developer/java/jdk-8 Java 1.8.0.5.13
(Java SE 8u5)
1.8.0.11.12
(Java SE 8u11)
diagnostic/wireshark Wireshark 1.10.6 1.10.7
library/cacao 2.4.2.0 2.4.3.0
library/java/javadb Java 10.6.2.1 10.6.2.3
library/nspr Mozilla NSPR 4.8.9 4.9.5
library/python/ceilometerclient OpenStack not included 1.0.10
library/python/cffi Python CFFI not included 0.8.2
library/python/cinderclient OpenStack 1.0.7 1.0.9
library/python/django Django not included 1.4.11
library/python/dnspython dnspython not included 1.11.1
library/python/dogpile.cache dogpile.cache not included 0.5.3
library/python/dogpile.core dogpile.core not included 0.4.1
library/python/heatclient OpenStack not included 0.2.9
library/python/iso8601 pyiso8601 not included 0.1.10
library/python/jinja2 Jinja not included 2.7.2
library/python/keystoneclient OpenStack 0.4.1 0.8.0
library/python/neutronclient OpenStack 2.3.1 2.3.4
library/python/novaclient OpenStack 2.15.0 2.17.0
library/python/oslo.config OpenStack not included 1.3.0
library/python/pbr OpenStack not included 0.8.1
library/python/pycparser pycparser not included 2.10
library/python/python-memcached python-memcached not included 1.53
library/python/six pypi six not included 1.6.1
library/python/swiftclient OpenStack 2.0.2 2.1.0
library/python/troveclient OpenStack not included 0.1.4
library/python/websockify websockify not included 0.5.1
library/python/xattr xattr not included 0.7.4
library/security/nss Mozilla NSS 4.13.1 4.14.3
library/security/openssl OpenSSL 1.0.1.7
(1.0.1g)
1.0.1.8
(1.0.1h)
mail/thunderbird Mozilla Thunderbird 17.0.6 17.0.11
network/dns/bind ISC BIND 9.6.3.10.2
(9.6-ESV-R10-P2)
9.6.3.11.0
(9.6-ESV-R11)
network/rsync rsync 3.0.9 3.1.0
runtime/java/jre-7 Java 1.7.0.55.13
(Java SE 7u55)
1.7.0.65.17
(Java SE 7u65)
runtime/java/jre-8 Java 1.8.0.5.13
(Java SE 8u5)
1.8.0.11.12
(Java SE 8u11)
security/nss-utilities Mozilla NSS 4.13.1 4.14.3
service/network/dns/bind ISC BIND 9.6.3.10.2
(9.6-ESV-R10-P2)
9.6.3.11.0
(9.6-ESV-R11)
shell/bash GNU Bash 4.1.9 4.1.11
system/test/sunvts Oracle VTS 7.18.0 7.18.1
web/browser/firefox Mozilla Firefox 17.0.6 17.0.11
web/java-servlet/tomcat Apache Tomcat 6.0.39 6.0.41
web/server/ejabberd ejabberd 2.1.8 2.1.13
A bit more than a year ago, I ordered a Geeksphone Peak, one of the first widely available Firefox OS phones to explore this new OS.

Those notes are probably not very useful on their own, but they might give a few hints to stuck Android developers.

The hardware

The device has a Qualcomm Snapdragon S4 MSM8225Q SoC, which uses the Adreno 203 and a 540x960 Protocol A (4 touchpoints) touchscreen.

The Adreno 203 (Note: might have been 205) is not supported by Freedreno, and is unlikely to be. It's already a couple of generations behind the latest models, and getting a display working on this device would also require (re-)writing a working panel driver.

At least the CPU is an ARMv7 with a hardware floating-point (unlike the incompatible ARMv6 used by the Raspberry Pi), which means that much more software is available for it.

Getting a shell

Start by installing the android-tools package, and copy the udev rules file to the correct location (it's mentioned with the rules file itself).

Then, on the phone, turn on the developer mode. Plug it in, and run "adb devices", you should see something like:

$ adb devices
List of devices attached
22ae7088f488 device

Now run "adb shell" and have a browse around. You'll realise that the kernel, drivers, init system, baseband stack, and much more, is plain Android. That's a good thing, as I could then order Embedded Android, and dive in further.

If you're feeling a bit restricted by the few command-line applications available, download an all-in-one precompiled busybox, and push it to the device with "adb push".

You can also use aafm, a simple GUI file manager, to browse around.

Getting a Fedora chroot

After formatting a MicroSD card in ext4 and unpacking a Fedora system image in it, I popped it inside the phone. You won't be able to use this very fragile script to launch your chroot just yet though, as we lack a number of kernel features that are required to run Fedora. You'll also note that this is an old version of Fedora. There are probably newer versions available around, but I couldn't pinpoint them while writing this article.

Runnning Fedora, even in a chroot, on such a system will allow us to compile natively (I wouldn't try to build WebKit on it though) and run against a glibc setup rather than Android's bionic libc.

Let's recompile the kernel to be able to use our new chroot.

Avoiding the brick

Before recompiling the kernel and bricking our device, we'll probably want to make sure that we have the ability to restore the original software. Nothing worse than a bricked device, right?

First, we'll unlock the bootloader, so we can modify the kernel, and eventually the bootloader. I took the instructions from this page, but ignored the bits about flashing the device, as we'll be doing that a different way.

You can grab the restore image from my Fedora people page, as, as seems to be the norm for Android(-ish) devices makers to deny any involvement in devices that are more than a couple of months old. No restore software, no product page.

The recovery should be as easy as

$ adb reboot-bootloader
$ fastboot flash boot boot.img
$ fastboot flash system system.img
$ fastboot flash userdata userdata.img
$ fastboot reboot

This technique on the Geeksphone forum might also still work.

Recompiling the kernel

The kernel shipped on this device is a modified Ice-Cream Sandwich "Strawberry" version, as spotted using the GPU driver code.

We grabbed the source code from Geeksphone's github tree, installed the ARM cross-compiler (in the "gcc-arm-linux-gnu" package on Fedora) and got compiling:

$ export ARCH=arm
$ export CROSS_COMPILE=/usr/bin/arm-linux-gnu-
$ make C8680_defconfig
# Make sure that CONFIG_DEVTMPFS and CONFIG_EXT4_FS_SECURITY get enabled in the .config
$ make

We now have a bzImage of the kernel. Launching "fastboot boot zimage /path/to/bzImage" didn't seem to work (it would have used the kernel only for the next boot), so we'll need to replace the kernel on the device.

It's a bit painful to have to do this, but we have the original boot image to restore in case our version doesn't work. The boot partition is on partition 8 of the MMC device. You'll need to install my package of the "android-BootTools" utilities to manipulate the boot image.


$ adb shell 'cat /dev/block/mmcblk0p8 > /mnt/sdcard/disk.img'
$ adb pull /mnt/sdcard/disk.img
$ bootunpack boot.img
$ mkbootimg --kernel /path/to/kernel-source/out/arch/arm/boot/zImage --ramdisk p8.img-ramdisk.cpio.gz --base 0x200000 --cmdline 'androidboot.hardware=qcom loglevel=1' --pagesize 4096 -o boot.img
$ adb reboot-bootloader
$ fastboot flash boot boot.img

If you don't want the graphical interface to run, you can modify the Android init to avoid that.

Getting a Fedora chroot, part 2

Run the script. It works. Hopefully.

If you manage to get this far, you'll have a running Android kernel and user-space, and will be able to use the Fedora chroot to compile software natively and poke at the hardware.

I would expect that, given a kernel source tree made available by the vendor, you could follow those instructions to transform your old Android phone into an ARM test "machine".

Going further, native Fedora boot

Not for the faint of heart!

The process is similar, but we'll need to replace the initrd in the boot image as well. In your chroot, install Rob Clark's hacked-up adb daemon with glibc support (packaged here) so that adb commands keep on working once we natively boot Fedora.

Modify the /etc/fstab so that the root partition is the SD card:

/dev/mmcblk1 /                       ext4    defaults        1 1

We'll need to create an initrd that's small enough to fit on the boot partition though:

$ dracut -o "dm dmraid dmsquash-live lvm mdraid multipath crypt mdraid dasd zfcp i18n" initramfs.img

Then run "mkbootimg" as above, but with the new ramdisk instead of the one unpacked from the original boot image.

Flash, and reboot.

Nice-to-haves

In the future, one would hope that packages such as adbd and the android-BootTools could get into Fedora, but I'm not too hopeful as Fedora, as a project, seems uninterested in running on top of Android hardware.

Conclusion

Why am I posting this now? Firstly, because it allows me to organise the notes I took nearly a year ago. Secondly, I don't have access to the hardware anymore, as it found a new home with Aleksander Morgado at GUADEC.

Aleksander hopes to use this device (Qualcomm-based, remember?) to add native telephony support to the QMI stack. This would in turn get us a ModemManager Telephony API, and the possibility of adding support for more hardware, such as through RIL and libhybris (similar to the oFono RIL plugin used in the Jolla phone).
July 30, 2014
or why publishing code is STEP ZERO.

If you've been developing code internally for a kernel contribution, you've probably got a lot of reasons not to default to working in the open from the start, you probably don't work for Red Hat or other companies with default to open policies, or perhaps you are scared of the scary kernel community, and want to present a polished gem.

If your company is a pain with legal reviews etc, you have probably spent/wasted months of engineering time on internal reviews and stuff, so think all of this matters later, because why wouldn't it, you just spent (wasted) a lot of time on it, so it must matter.

So you have your polished codebase, why wouldn't those kernel maintainers love to merge it.

Then you publish the source code.

Oh, look you just left your house. The merging of your code is many many miles distant and you just started walking that road, just now, not when you started writing it, not when you started legal review, not when you rewrote it internally the 4th time. You just did it this moment.

You might have to rewrite it externally 6 times, you might never get it merged, it might be something your competitors are also working on, and the kernel maintainers would rather you cooperated with people your management would lose their minds over, that is the kernel development process.

step zero: publish the code. leave the house.

(lately I've been seeing this problem more and more, so I decided to write it up, and it really isn't directed at anyone in particular, I think a lot of vendors are guilty of this).
July 25, 2014
Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:
  • Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
  • Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
  • Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
  • Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.
July 23, 2014
DRI3 has plenty of necessary fixes for X.org and Wayland, but it's still young in its integration. It's been integrated in the upcoming Fedora 21, and recently in Arch as well.

If WebKitGTK+ applications hang or become unusably slow when an HTML5 video is supposed to be, you might be hitting this bug.

If Totem crashes on startup, it's likely this problem, reported against cogl for now.

Feel free to add a comment if you see other bugs related to DRI3, or have more information about those.

Update: Wayland is already perfect, and doesn't use DRI3. The "DRI2" structures in Mesa are just that, structures. With Wayland, the DRI2 protocol isn't actually used.
I've just pushed the vc4-sim-validate branch to my Mesa tree. It's the culmination of the last week's worth pondering and false starts since I got my first texture sampling in simulation last Wednesday.

Handling texturing on vc4 safely is a pain. The pointer to texture contents doesn't appear in the normal command stream, and instead it's in the uniform stream. Which uniform happens to contain the pointer depends on how many uniforms have been loaded by the time you get to the QPU_W_TMU[01]_[STRB] writes. Since there's no iommu, I can't trust userspace to tell me where the uniform is, otherwise I'd be allowing them to just lie and put in physical addresses and read arbitrary system memory.

This meant I had to write a shader parser for the kernel, have that spit out a collection of references to texture samples, switch the uniform data from living in BOs in the user -> kernel ABI and instead be passed in as normal system memory that gets copied to the temporary exec bo, and then do relocations on that.

Instead of trying to write this in the kernel, with a ~10 minute turnaround time per test run, I copied my kernel code into Mesa with a little bit of wrapper code to give a kernel-like API environment, and did my development on that. When I'm looking at possibly 100s of iterations to get all the validation code working, it was well worth the day spent to build that infrastructure so that I could get my testing turnaround time down to about 15 sec.

I haven't done actual validation to make sure that the texture samples don't access outside of the bounds of the texture yet (though I at least have the infrastructure necessary now), just like I haven't done that validation for so many other pointers (vertex fetch, tile load/stores, etc.). I also need to copy the code back out to the kernel driver, and it really deserves some cleanups to add sanity to the many different addresses involved (unvalidated vaddr, validated vaddr, and validated paddr of the data for each of render, bin, shader recs, uniforms). But hopefully once I do that, I can soon start bringing up glamor on the Pi (though I've got some major issue with tile allocation BO memory management before anything's stable on the Pi).
July 22, 2014

Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).

Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:
STATE_BASE_ADDRESS

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin’ it Happen

64b

As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

“This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 5122 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
        return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
        return native_pte_val(pte) &amp; PTE_FLAGS_MASK;
}

static inline int pte_present(pte_t a)
{
        return pte_flags(a) &amp; (_PAGE_PRESENT | _PAGE_PROTNONE |
                               _PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
}
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)-&gt;pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current-&gt;mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk(&quot;Page is present: %s\n&quot;, pte_present(*ptep) ? &quot;yes&quot; : &quot;no&quot;);

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

  • PML4[0-255], each point to a PDP
  • PDP[0-255][0-511], each point to a PD
  • PD[0-255][0-511][0-511], each point to a PT
  • PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0×200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\  34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\  34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
  256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\  67240192entries \times \frac{1bit}{1entry} = 67240192bits \\  67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\  4294967296bytes + 8405024bytes = 4303372320bytes \\  4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
	...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
	...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
	...
}

/**
 * alloc_page_tables - Allocate all page tables for the given virtual address.
 *
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 *
 * @ppgtt:	The PPGTT structure encapsulating the virtual address space.
 * @address:	The virtual address for which we want page tables.
 *
 */
static void
alloc_page_tables(ppgtt, unsigned long address)
{
	struct i915_pagetab *pt;
	struct i915_pagedir *pd;
	struct i915_pagedirpo *pdp;
	struct i915_pml4 *pml4 = &amp;ppgtt-&gt;pml4; /* Always there */

	int pml4e = (address &gt;&gt; GEN8_PML4E_SHIFT) &amp; GEN8_PML4E_MASK;
	int pdpe = (address &gt;&gt; GEN8_PDPE_SHIFT) &amp; GEN8_PDPE_MASK;
	int pde = (address &gt;&gt; GEN8_PDE_SHIFT) &amp; I915_PDE_MASK;
	int pte = (address &amp; I915_PDES_PER_PD);

	if (!test_bit(pml4e, pml4-&gt;used_pml4es))
		goto alloc;

	pdp = pml4-&gt;pagedirpo[pml4e];
	if (!test_bit(pdpe, pdp-&gt;used_pdpes;))
		goto alloc;

	pd = pdp-&gt;pagedirs[pdpe];
	if (!test_bit(pde, pd-&gt;used_pdes)
		goto alloc;

	pt = pd-&gt;page_tables[pde];
	if (test_bit(pte, pt-&gt;used_ptes))
		return;

alloc_pdp:
	pdp = alloc_one_pdp(pml4, pml4e);
	set_bit(pml4e, pml4-&gt;used_pml4es);
alloc_pd:
	pd = alloc_one_pd(pdp, pdpe);
	set_bit(pdpe, pdp-&gt;used_pdpes);
alloc_pt:
	pt = alloc_one_pt(pd, pde);
	set_bit(pde, pd-&gt;used_pdes);
}

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

Bitmaps tracking page tables

The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
	__u64 user_ptr;
	__u64 user_size;
	__u32 ctx_id;
	__u32 flags;
#define I915_USERPTR_READ_ONLY          (1&lt;&lt;0)
#define I915_USERPTR_GPU_MIRROR         (1&lt;&lt;1)
#define I915_USERPTR_UNSYNCHRONIZED     (1&lt;&lt;31)
	/**
	 * Returned handle for the object.
	 *
	 * Object handles are nonzero.
	 */
	__u32 handle;
	__u32 pad;
};

The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

  • Pros:
    • This interface is very simple.
    • Existing userptr code does the hard work for us
  • Cons:
    • You need 1 IOCTL per object. Much undeeded overhead.
    • It’s subject to a lot of problems userptr has5
    • Userptr was already merged, so unless pad get’s repruposed, we’re screwed

What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I’ve created. Feel free to do with them as you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/legacy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/mirrored.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/table_hierarchy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/addr-bitmap.svg

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN" 

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)  

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect. 

  4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took 

  5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring 

July 21, 2014

Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren’t terrible, this isn’t as useful as it once was, and because this uses Glamor in a weird way, we’re making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding “_uxa” to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new ‘intel_uxa.h” file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
    int ok = 0;

    if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
        ok = glamor_fill_spans_nf(pDrawable,
                      pGC, n, ppt, pwidth, fSorted);
        uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
    }

    if (!ok)
        goto fallback;

    return;
}

Pulling those out shrank the UXA code by quite a bit.

Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an ‘accel’ variable which could be one of three options:

  • ACCEL_GLAMOR.
  • ACCEL_UXA.
  • ACCEL_NONE

I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we’ve got lots of places that look like:

        switch (intel->accel) {
#if USE_GLAMOR
        case ACCEL_GLAMOR:
                if (!intel_glamor_create_screen_resources(screen))
                        return FALSE;
                break;
#endif
#if USE_UXA
        case ACCEL_UXA:
                if (!intel_uxa_create_screen_resources(screen))
                        return FALSE;
        break;
#endif
        case ACCEL_NONE:
                if (!intel_none_create_screen_resources(screen))
                        return FALSE;
                break;
        }

Using a switch means that we can easily elide code that isn’t wanted in a particular build. Of course ‘accel’ is an enum, so places which are missing one of the required paths will cause a compiler warning.

It’s not all perfectly clean yet; there are piles of UXA-only paths still.

Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef’s in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
    -----------             ------          -------
    none                      7143            73039
    glamor                    7397            76540
    uxa                      25979           283777
    sna                     118832          1303904

    none legacy              14449           152480
    glamor legacy            14703           156125
    uxa legacy               33285           350685
    sna legacy              126138          1395231

The ‘legacy’ addition supports i810-class hardware, which is needed for a complete driver.

Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling (!) because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven’t analyzed what effect this has on performance.

Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It’s generally the fastest way of updating the screen as you don’t have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that’s pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,
                               pixmap,
                               &stride,
                               &size);


bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
    close(fd);
    intel_glamor_get_pixmap(pixmap)->bo = bo;

That last bit remembers the bo in some local memory so we don’t have to do this more than once for each pixmap. glamorfdfrompixmap ends up calling eglCreateImageKHR followed by gbmbo_import and then a kernel ioctl to convert a prime handle into an fd. It’s all quite round-about, but it does seem to work just fine.

After I’d gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn’t working. That turned out to have an unexpected consequence — all full-screen applications would run flat-out, and not be limited to frame rate. Present ‘recovers’ from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

Painting Root before Mode set

The X server has generally done initialization in one order:

  1. Create root pixmap
  2. Set video modes
  3. Paint root window

Recently, we’ve added a ‘-background none’ option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can’t use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

  1. Create root pixmap
  2. Paint root window
  3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we’re not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy — just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge — I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn’t call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.

Performance

I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor’s image functions, and the UXA GetImage is pretty slow. Using Mesa’s image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

       1                 2                 Operation
------------   -------------------------   -------------------------
     50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square 
     12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square 
      1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square 
      3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square 
        36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     49800.0        50200.0 (     1.008)   GetImage 10x10 square 
      5690.0        19300.0 (     3.392)   GetImage 100x100 square 
       609.0         1360.0 (     2.233)   GetImage 500x500 square 
      3100.0          206.0 (     0.066)   GetImage XY 10x10 square 
        36.4          183.0 (     5.027)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

       1                 2                 Operation
------------   -------------------------   -------------------------
     43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square 
      2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square 
       130.0         4250.0 (    32.692)   ShmGetImage 500x500 square 
      3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square 
        36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     41700.0        50200.0 (     1.204)   GetImage 10x10 square 
      2520.0        19300.0 (     7.659)   GetImage 100x100 square 
       125.0         1360.0 (    10.880)   GetImage 500x500 square 
      3150.0          206.0 (     0.065)   GetImage XY 10x10 square 
        36.1          183.0 (     5.069)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Of course, this is all just x11perf, which doesn’t represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it’s nice to have this kind of speed up.

Status

I’m running this on my crash box to get some performance numbers and continue testing it. I’ll switch my desktop over when I feel a bit more comfortable with how it’s working. But, I think it’s feature complete at this point.

Where’s the Code

As usual, the code is in my personal repository. It’s on the ‘glamor’ branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor
July 19, 2014

Hello,

As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won’t be present in my implementation. That’s why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.


July 17, 2014

Two years ago, I got appointed as chairman of the openSUSE Board. I was very excited about this opportunity, especially as it allowed me to keep contributing to openSUSE, after having moved to work on the cloud a few months before. I remember how I wanted to find new ways to participate in the project, and this was just a fantastic match for this. I had been on the GNOME Foundation board for a long time, so I knew it would not be easy and always fun, but I also knew I would pretty much enjoy it. And I did.

Fast-forward to today: I'm still deeply caring about the project and I'm still excited about what we do in the openSUSE board. However, some happy event to come in a couple of months means that I'll have much less time to dedicate to openSUSE (and other projects). Therefore I decided a couple of months ago that I would step down before the end of the summer, after we'd have prepared the plan for the transition. Not an easy decision, but the right one, I feel.

And here we are now, with the official news out: I'm no longer the chairman :-) (See also this thread) Of course I'll still stay around and contribute to openSUSE, no worry about that! But as mentioned above, I'll have less time for that as offline life will be more "busy".

openSUSE Board Chairman at oSC14

openSUSE Board Chairman at oSC14

Since I mentioned that we were working on a transition... First, knowing the current board, I have no doubt everything will be kept pushed in the right direction. But on top of that, my good friend Richard Brown has been appointed as the new chairman. Richard knows the project pretty well and he has been on the board for some time now, so is aware of everything that's going on. I've been able to watch his passion for the project, and that's why I'm 100% confident that he will rock!

Anandtech recently went all out on the ARM midgard architecture (Mali T series). This was quite astounding, as ARM MPD tends to be a pretty closed shop. The Anandtech coverage included an in-depth view of the Mali Midgard GPU, a (short) Q&A session with Jem Davies (the head honcho of ARM MPD, ARMs Media Processing Division, the part of ARM that develops the Mali and display and video engines) and a google hangout with Jem Davies a week later.

This set of articles does not seem like the sort of thing that ARM MPD would have initiated itself. Since both Imagination Technologies and NVidia did something similar months earlier, my feeling is that this was either initiated by Anand Lal Shimpi himself, or that this was requested by ARM marketing in response to the other articles.

Several interesting observations can be made from this though, especially from the answers (or sometimes, lack thereof) to the Q&A and google hangout sessions.

Hiding behind Linaro.

First off, Mr Davies still does not see an open source driver as a worthwhile endeavour for ARM MPD, and this is a position that hasn't changed since i started the lima driver, when my former employer went and talked to ARM management. Rumour has it that most of ARMs engineers both in MPD and other departments would like this to be different, and that Mr Davies is mostly alone with his views, but that's currently just hearsay. He himself states that there are only business reasons against an open source driver for the mali.

To give some weight to this, Mr Davies stated that he contributed to the linux kernel, and i called him out on that one, as i couldn't find any mention of him in a kernel git tree. It seems however that his contributions are from during the Bitkeeper days, and that the author trail on those changes probably got lost. But, having contributed to a project at one point or another is, to me, not proof that one actively supports the idea of open source software, at best it proves that adding support to the kernel for a given ARM device or subsystem was simply necessary at one point.

Mr Davies also talked about how ARM is investing a lot in linaro, as a proof of ARMs support of open source software. Linaro is a consortium to further linux on ARM, so per definition ARM plays a very big role in it. But it is not ARM MPD that drives linaro, it is ARM itself. So this is not proof of ARM MPD actively supporting open source software. Mr Davies did not claim differently, but this distinction should be made very clear in this context.

Then, linaro can be described as an industry consortium. For non-founding members of a consortium, such a construction is often used to park some less useful people while gaining the priviledge to claim involvement as and when desired. The difference to other consortiums is that most of the members come from a deeply embedded background, where the word "open" never was spoken before, and, magically, simply by having joined linaro, those deeply embedded companies now feel like they succesfully ticked the "open source" box on their marketing checklist. Several of linaros members are still having severe difficulty conforming to the GPL, but they still proudly wear the linaro badge as proof of their open source...ness?

As a prominent member of the sunxi community, I am most familiar with Allwinner, a small chinese cheap SoC designer. At the start of the year, we were seeing some solid signs of Allwinner opening up to our community directly. In March however, Allwinner joined linaro and people were hopeful that this meant that a new era of openness had started for Allwinner. As usual, I was the only cynical voice and i warned that this could mean that Allwinner now wouldn't see the need to further engage with us. Ever since, we haven't been able to reach our contacts inside Allwinner anymore, and even our requests for compliance with the GPL get ignored.

Linaro membership does not absolve from limited open source involvement or downright license violation, but for many members, this is exactly how it is used. Linaro seems to be a get-out-of-jail-free card for several of its members. Linaro membership does not need to prove anything, linaro membership even seems to have the opposite effect in several cases.

ARM driving linaro is simply no proof that ARM MPD supports open source software.

The patent excuse.

I am amazed that people still attempt to use this as an argument against open source graphics drivers.

Usually this is combined with the claim that open source drivers are exposing too much of the inner workings of the hardware. But this logic in itself states that the hardware is the problem, not the software. The hardware itself might or might not have patent issues, and it is just a matter of time before the owner of said breached patents will come a-knocking. At best, an open source driver might speed up the discovery of said issues, but the driver itself never is the cause, as the problems will have been there all along.

One would actually think that the Anandtech article about the midgard architecture would reveal more about the hardware, and trigger more litigation, than the lima driver could ever do, especially given how neatly packaged an in depth anandtech article is. Yet ARM MPD seemed to have had no issue with exposing this much information in their marketing blitz.

I also do not believe that patents are such a big issue. If graphics hardware patents were such big business, you would expect that an industry expert in graphics, especially one who is a dab hand at reverse engineering, would be contacted all the time to help expose potential patent issues. Yet i never have been contacted, and i know of no-one who ever has been.

Similarly. the first bits of lima code were made available 2.5 years ago, with bits trickling out slowly (much to my regret), and there are still several unknowns today. If lima played any role in patent disputes, you would again expect that i would be asked to support those looking to assert their patents. Again, nothing.

GPU Patents are just an excuse, nothing more.

When I was at SuSE, we freed ATI for AMD, and we never did hear that excuse. AMD wanted a solid open source strategy for ATI as ATI was not playing ball after the merger, and the bad publicity was hurting server (CPU) sales. Once the decision was made to go the open source route, patents suddenly were not an issue anymore. We did however have to deal with IP issues (or actually, AMD did - we made very sure we didn't get anything that wasn't supposed to be free), such as HDCP and media decoding, which ATI was not at liberty to make public. Given the very heated war that ATI and Nvidia fought at the time, and the huge amount of revenue in this market, you would think that ATI would be a very likely candidate for patent litigation, yet this never stood in the way of an open source driver.

There is another reason as to why patents are that popular an excuse. The words "troll" and "legal wrangling" are often sprinkled around as well so that images of shady deals being made by lawyers in smokey backrooms usually come to mind. Yet we never get to hear the details of patent cases, as even Mr Davies himself states that ARM is not making details available of ongoing cases. I also do not know of any public details on cases that have been closed already (not that i have actively looked - feel free to enlighten me). Patents are a perfect blanket excuse where proof apparently does not seem to be required.

We open source developers are very much aware of the damage that software patents do, and this makes the patent weapon perfect for deployment against those who support open source software. But there is a difference between software patents and the patent cases that ARM potentially has to deal with on the Mali. Yet we seem to have made patents our own kryptonite, and are way too easily lulled into backing off at the first mention of the word patent.

Patents are a poor excuse, as there is no direct relationship between an open source driver and the patent litigation around the hardware.

The Resources discussion.

As a hardware vendor (or IP provider) doing a free software driver never is for free. A lot of developer time does need to be invested, and this is an ongoing commitment. So yes, a viable open source driver for the Mali will consume some amount of resources.

Mr Davies states that MPD would have to incur this cost on its own, as MPD seems to be a completely separate unit and that further investment can only come from profit made within this group. In light of that information, I must apologize for ever having treated ARM and ARM MPD as one and the same with respect to this topic. I will from now on make it very clear that it is ARM MPD, and ARM MPD alone, that doesn't want an open source mali driver.

I do believe that Mr Davies his cost versus gain calculations are too direct and do not allow for secondary effects.

I also believe that an ongoing refusal to support an open source strategy for Mali will reflect badly on the sale of ARM processors and other IP, especially with ARM now pushing into the server market and getting into intel territory. The actions of ARM MPD do affect ARM itself, and vice versa. Admittedly, not as much as some with those that more closely tie the in-house GPU with the rest of the system, but that's far from an absolute lack of shared dependency and responsibility.

The Mali binary problem.

One person in the Q&A section asked why ARM isn't doing redistributable drivers like Nvidia does for the Tegra. Mr Davies answered that this was a good idea, and that linaro was doing something along those lines.

Today, ironically, I am the canonical source for mali-400 binaries. At the sunxi project, we got some binaries from the Cubietech people, built from code they received from allwinner, and the legal terms they were under did not prevent them from releasing the built binaries to the public. Around them (or at least, using the binaries as a separate git module) I built a small make based installation system which integrates with ARMs open source memory manager (UMP) and even included a quick GLES test from the lima tests. I stopped just short of debian packaging. The sunxi-mali repository, and the wiki tutorial that goes with it, now is used by many other projects (like for instance linux-rockchip) as their canonical source for (halfway usable) GPU support.

There are several severe problems with these binaries, which we have either fixed directly, have been working around or just have to live with. Direct fixes include adding missing library dependencies, and hollowing out a destructor function which made X complain. These are binary hacks. The xf86-video-fbturbo driver from Siarhei Siamashka works around the broken DRI2 buffer management, but it has to try to autodetect how to work around the issues, as it is differently broken on the different versions of the X11 binaries we have. Then there is the flaky coverage, as we only have binaries for a handful of kernel APIs, making it impossible to match them against all vendor provided SoC/device kernels. We also only have binaries for fbdev or X11, and sometimes for android, mostly for armhf, but not always... It's just one big mess, only slightly better than having nothing at all.

Much to our surprise, in oktober of last year, ARM MPD published a howto entry about setting up a working driver for mali midgard on the chromebook. It was a step in the right direction, but involved quite a bit off faff, and Connor Abbott (the brilliant teenager REing the mali shaders) had to go and pour things into a proper git repository so that it would be more immediately useful. Another bout of insane irony, as this laudable step in the right direction by ARM MPD ultimately left something to be desired.

ARM MPD is not like ATI, Nvidia, or even intel, qualcomm or broadcom. The Mali is built into many very different SoC families, and needs to be integrated with different display engines, 2D engines, media engines and memory/cache subsystems.

Even the distribution of drivers is different. From what i understand, mali drivers are handled as follows. The Mali licensees get the relevant and/or latest mali driver source code and access to some support from ARM MPD. The device makers, however, only rarely get their hands on source code themselves and usually have to make do with the binaries provided by the SoC vendor. Similarly, the device maker only rarely gets to deal with ARM MPD directly, and usually needs to deal with some proxy at the SoC vendor. This setup puts the responsibility of SoC integration squarely at the SoC vendor, and is well suited for the current mobile market: one image per device at release, and then almost no updates. But that market is changing with the likes of Cyanogenmod, and other markets are opening or are actively being opened by ARM, and those require a completely different mode of operation.

There is gap in Mali driver support that ARM MPDs model of driver delivery does not cater for today, and ARM MPD knows about this. But MPD is going to be fighting an upbill battle to try to correct this properly.

Binary solutions?

So how can ARM MPD try to tackle this problem?

Would ARM MPD keep the burden of making suitable binaries available solely with SoC vendors or device makers? Not likely as that is a pretty shakey affair that's actively hurting the mali ecosystem. SoCs for the mobile market have incredibly short lives, and SoC and device software support is so fragmented that these vendors would be responsible for backporting bugfixes to a very wide array of kernels and SoC versions. On top of that, those vendors would only support a limited subset of windowing systems, possibly even only android as this is their primary market. Then, they would have to set up the support infrastructure to appropriately deal with user queries and bug reports. Only very few vendors will end up even attempting to do this, and none are doing so today. In the end, any improvement at this end will bring no advantages to the mali brand or ARM MPD. If this path is kept, we will not move on from the abysmal situation we are in today, and the Mali will remain to be seen as a very fragmented product.

ARM MPD has little other option but to try to tackle this itself, directly, and it should do so more proactively than by hiding behind linaro. Unfortunately, to make any real headway here, this means providing binaries for every kernel driver interface, and the SoC vendor changes to those interfaces, on top of other bits of SoC specific integration. But this also means dealing with user support directly, and these users will of course spend half their time asking questions which should be aimed at the SoC vendor. How is ARM MPD going to convince SoC vendors to participate here? Or is ARM MPD going to maintain most of the SoC integration work themselves? Surely it will not keep the burden only at linaro, wasting the resources of the rest of ARM and of linaro partners?

ARM MPD just is in a totally different position than the ATIs and Nvidias of this world. Providing binaries that will satisfy a sufficient part of the need is going to be a huge drain on resources. Sure, MPD is not spending the same amount of resources on optimizing for specific setups and specific games like ATI or Nvidia are doing, but they will instead have to spend it on the different SoCs and devices out there. And that's before we start talking about different windowing infrastructure, beyond surfaceflinger, fbdev or X11. Think wayland, mir, even directFB, or any other special requirements that people tend to have for their embedded hardware.

At best, ARM MPD itself will manage to support surfaceflinger, fbdev and X11 on just a handful of popular devices. But how will ARM MPD know beforehand which devices are going to be popular? How will ARM MPD keep on making sure that the binaries match the available vendor or device kernel trees? Would MPD take the insane route of maintaining their own kernel repositories with a suitable mali kernel driver for those few chosen devices, and backporting changes from the real vendor trees instead? No way.

Attempting to solve this very MPD specific problem with only binaries, to any degree of success, is going to be a huge drain on MPD resources, and in the end, people will still not be satisfied. The problem will remain.

The only fitting solution is an open source driver. Of course, the Samsungs of this world will not ship their flagship phones with just an open source GPU driver in the next few years. But an open source driver will fundamentally solve the issues people currently have with Mali, the issues which fuel both the demand for fitting distributable binaries and for an open source driver. Only an open source driver can be flexible and cost-effective enough to fill that gap. Only an open source driver can get silicon vendors, device makers, solution creators and users chipping in, satisfying their own, very varied, needs.

Change is coming.

The ARM world is rapidly changing. Hardware review sites, which used to only review PC hardware, are more and more taking notice of what is happening in the mobile space. Companies that are still mostly stuck in embedded thinking are having to more and more act like PC hardware makers. The lack of sufficiently broad driver support is becoming a real issue, and one that cannot be solved easily or cheaply with a quick binary fix, especially for those who sell no silicon of their own.

The Mali marketing show on Anandtech tells us that things are looking up. The market is forcing ARM MPD to be more open, and MPD has to either sink or swim. The next step was demonstrated by yours truly and some other very enterprising individuals, and now both Nvidia and Broadcom are going all the way. It is just a matter of time before ARM MPD has to follow, as they need this more than their more progressive competitors.

To finish off, at the end of the Q&A session, someone asked: "Would free drivers gives greater value to the shareholders of ARM?". After a quick braindump, i concluded "Does ARMs lack of free drivers hurt shareholder value?" But we really should be stating "To what extent does ARMs lack of free drivers hurt shareholder value?".
July 16, 2014

appstream-logoToday I am very happy to announce the release of AppStream 0.7, the second-largest release (judging by commit number) after 0.6. AppStream 0.7 brings many new features for the specification, adds lots of good stuff to libappstream, introduces a new libappstream-qt library for Qt developers and, as always, fixes some bugs.

Unfortunately we broke the API/ABI of libappstream, so please adjust your code accordingly. Apart from that, any other changes are backwards-compatible. So, here is an overview of what’s new in AppStream 0.7:

Specification changes

Distributors may now specify a new <languages/> tag in their distribution XML, providing information about the languages a component supports and the completion-percentage for the language. This allows software-centers to apply smart filtering on applications to highlight the ones which are available in the users native language.

A new addon component type was added to represent software which is designed to be used together with a specific other application (think of a Firefox addon or GNOME-Shell extension). Software-center applications can group the addons together with their main application to provide an easy way for users to install additional functionality for existing applications.

The <provides/> tag gained a new dbus item-type to expose D-Bus interface names the component provides to the outside world. This means in future it will be possible to search for components providing a specific dbus service:

$ appstream-index what-provides dbus org.freedesktop.PackageKit.desktop system

(if you are using the cli tool)

A <developer_name/> tag was added to the generic component definition to define the name of the component developer in a human-readable form. Possible values are, for example “The KDE Community”, “GNOME Developers” or even the developer’s full name. This value can be (optionally) translated and will be displayed in software-centers.

An <update_contact/> tag was added to the specification, to provide a convenient way for distributors to reach upstream to talk about changes made to their metadata or issues with the latest software update. This tag was already used by some projects before, and has now been added to the official specification.

Timestamps in <release/> tags must now be UNIX epochs, YYYYMMDD is no longer valid (fortunately, everyone is already using UNIX epochs).

Last but not least, the <pkgname/> tag is now allowed multiple times per component. We still recommend to create metapackages according to the contents the upstream metadata describes and place the file there. However, in some cases defining one component to be in multiple packages is a short way to make metadata available correctly without excessive package-tuning (which can become difficult if a <provides/> tag needs to be satisfied).

As small sidenote: The multiarch path in /usr/share/appdata is now deprecated, because we think that we can live without it (by shipping -data packages per library and using smarter AppStream metadata generators which take advantage of the ability to define multiple <pkgname/> tags)

Documentation updates

In general, the documentation of the specification has been reworked to be easier to understand and to include less duplication of information. We now use excessive crosslinking to show you the information you need in order to write metadata for your upstream project, or to implement a metadata generator for your distribution.

Because the specification needs to define the allowed tags completely and contain as much information as possible, it is not very easy to digest for upstream authors who just want some metadata shipped quickly. In order to help them, we now have “Quickstart pages” in the documentation, which are rich of examples and contain the most important subset of information you need to write a good metadata file. These quickstart guides already exist for desktop-applications and addons, more will follow in future.

We also have an explicit section dealing with the question “How do I translate upstream metadata?” now.

More changes to the docs are planned for the next point releases. You can find the full project documentation at Freedesktop.

AppStream GObject library and tools

The libappstream library also received lots of changes. The most important one: We switched from using LGPL-3+ to LGPL-2.1+. People who know me know that I love the v3 license family of GPL licenses – I like it for tivoization protection, it’s explicit compatibility with some important other licenses and cosmetic details, like entities not loosing their right to use the software forever after a license violation. However, a LGPL-3+ library does not mix well with projects licensed under other open source licenses, mainly GPL-2-only projects. I want libappstream to be used by anyone without forcing the project to change its license. For some reason, using the library from proprietary code is easier than using it from a GPL-2-only open source project. The license change was also a popular request of people wanting to use the library, so I made the switch with 0.7. If you want to know more about the LGPL-3 issues, I recommend reading this blogpost by Nikos (GnuTLS).

On the code-side, libappstream received a large pile of bugfixes and some internal restructuring. This makes the cache builder about 5% faster (depending on your system and the amount of metadata which needs to be processed) and prepares for future changes (e.g. I plan to obsolete PackageKit’s desktop-file-database in the long term).

The library also brings back support for legacy AppData files, which it can now read. However, appstream-validate will not validate these files (and kindly ask you to migrate to the new format).

The appstream-index tool received some changes, making it’s command-line interface a bit more modern. It is also possible now to place the Xapian cache at arbitrary locations, which is a nice feature for developers.

Additionally, the testsuite got improved and should now work on systems which do not have metadata installed.

Of course, libappstream also implements all features of the new 0.7 specification.

With the 0.7 release, some symbols were removed which have been deprecated for a few releases, most notably as_component_get/set_idname, as_database_find_components_by_str, as_component_get/set_homepage and the “pkgname” property of AsComponent (which is now a string array and called “pkgnames”). API level was bumped to 1.

Appstream-Qt

A Qt library to access AppStream data has been added. So if you want to use AppStream metadata in your Qt application, you can easily do that now without touching any GLib/GObject based code!

Special thanks to Sune Vuorela for his nice rework of the Qt library!

And that’s it with the changes for now! Thanks to everyone who helped making 0.7 ready, being it feedback, contributions to the documentation, translation or coding. You can get the release tarballs at Freedesktop. Have fun!

July 14, 2014

Following Christian's Wayland in Fedora Update post, and after Hans fixed the touchpad acceleration, I've been playing with pointer acceleration in libinput a bit. The main focus was not yet on changing it but rather on figuring out what we actually do and where the room for improvement is. There's a tool in my (rather messy) github wip/ptraccel-work branchto re-generate the graphs below.

This was triggered by a simple plan: I want a configuration interface in libinput that provides a sliding scale from -1 to 1 to adjust a device's virtual speed from slowest to fastest, with 0 being the default for that device. A user should not have to worry about the accel mechanism itself, which may be different for any given device, all they need to know is that the setting -0.5 means "halfway between default and 'holy cow this moves like molasses!'". The utopia is of course that for any given acceleration setting, every device feels equally fast (or slow). In order to do that, I needed the right knobs to tweak.

The code we currently have in libinput is pretty much 1:1 what's used in the X server. The X server sports a lot more configuration options, but what we have in libinput 0.4.0 is essentially what the default acceleration settings are in X. Armed with the knowledge that any #define is a potential knob for configuration I went to investigate. There are two defines that are labelled as adjustible parameters:

  • DEFAULT_THRESHOLD, set to 0.4
  • DEFAULT_ACCELERATION, set to 2.0
But what do they mean, exactly? And what exactly does a value of 0.4 represent?
[side-note: threshold was 4 until I took the constant multiplier out, it's now 0.4 upstream and all the graphs represent that.]

Pointer acceleration is nothing more than mapping some input data to some potentially faster output data. How much faster depends on how fast the device moves, and to get there one usually needs a couple of steps. The trick of course is to make it predictable, so that despite the acceleration, your brain thinks that the visible cursor is an extension of your hand at all speeds.

Let's look at a high-level outline of our pointer acceleration code:

  • calculate the velocity of the current movement
  • use that velocity to calculate the acceleration factor
  • apply accel to dx/dy
  • smoothen out the dx/dy to avoid abrupt changes between two events

Calculating pointer speed

We don't just use dx/dy as values, rather, we use the pointer velocity. There's a simple reason for that: dx/dy depends on the device's poll rate (or interrupt frequency). A device that polls twice as often sends half the dx/dy values in each event for the same physical speed.

Calculating the velocity is easy: divide dx/dy by the delta time. We use a set of "trackers" that store previous dx/dy values with their timestamp. As long as we get movement in the same cardinal direction, we take those into account. So if we have 5 events in direction NE, the speed is averaged over those 5 events, smoothing out abrupt speed changes.

The acceleration function

The speed we just calculated is passed to the acceleration function to calculate an acceleration factor.

Figure 1: Mapping of velocity in unit/ms to acceleration factor (unitless). X axes here are labelled in units/ms and mm/s.
This function is the only place where DEFAULT_THRESHOLD/DEFAULT_ACCELERATION are used, but they mostly just stretch the graph. The shape stays the same.

The output of this function is a unit-less acceleration factor that is applied to dx/dy. A factor of 1 means leaving dx/dy untouched, 0.5 is half-speed, 2 is double-speed.

Let's look at the graph for the accel factor output (red): for very slow speeds we have an acceleration factor < 1.0, i.e. we're slowing things down. There is a distinct plateau up to the threshold of 0.4, after that it shoots up to roughly a factor of 1.6 where it flattens out a bit until we hit the max acceleration factor

Now we can also put units to the two defaults: Threshold is clearly in units/ms, and the acceleration factor is simply a maximum. Whether those are mentally easy to map is a different question.

We don't use the output of the function as-is, rather we smooth it out using the Simpson's rule. The second (green) curve shows the accel factor after the smoothing took effect. This is a contrived example, the tool to generate this data simply increased the velocity, hence this particular line. For more random data, see Figure 2.

Figure 2: Mapping of velocity in unit/ms to acceleration factor (unitless) for a random data set. X axes here are labelled in units/ms and mm/s.
For the data set, I recorded the velocity from libinput while using Firefox a bit.

The smoothing takes history into account, so the data points we get depend on the usage. In this data set (and others I tested) we see that the majority of the points still lie on or close to the pure function, apparently the delta doesn't matter that much. Nonetheless, there are a few points that suggest that the smoothing does take effect in some cases.

It's important to note that this is already the second smoothing to take effect - remember that the velocity (may) average over multiple events and thus smoothens the input data. However, the two smoothing effects somewhat complement each other: velocity smoothing only happens when the pointer moves consistently without much change, the Simpson's smoothing effect is most pronounced when the pointer moves erratically.

Ok, now we have the basic function, let's look at the effect.

Pointer speed mappings

Figure 3: Mapping raw unaccelerated dx to accelerated dx, in mm/s assuming a constant pysical device resolution of 400 dpi that sends events at 125Hz. dx range mapped is 0..127
The graph was produced by sending 30 events with the same constant speed, then dividing by the number of events to reduce any effects tracker feeding has at the initial couple of events.

The two lines show the actual output speed in mm/s and the gain in mm/s, i.e. (output speed - input speed). We can see that the little nook where the threshold kicks in and after the acceleration is linear. Look at Figure 1 again: the linear acceleration is caused by the acceleration factor maxing out quickly.

Most of this graph is theoretical only though. On your average mouse you don't usually get a delta greater than 10 or 15 and this graph covers the theoretical range to 127. So you'd only ever be seeing the effect of up to ~120 mm/s. So a more realistic view of the graph is:

Figure 4: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event).
Same data as Figure 3, but zoomed to the realistic range. We go from a linear speed increase (no acceleration) to a quick bump once the threshold is hit and from then on to a linear speed increase once the maximum acceleration is hit.

And to verify, the ratio of output speed : input speed:

Figure 5: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated.

Looks pretty much exactly like the pure acceleration function, which is to be expected. What's important here though is that this is the effective speed, not some mathematical abstraction. And it shows one limitation: we go from 0 to full acceleration within really small window.

Again, this is the full theoretical range, the more realistic range is:

Figure 6: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated. Zoomed in to a max of 120 mm/s (15 dx/event).
Same data as Figure 5, just zoomed in to a maximum of 120 mm/s. If we assume that 15 dx/event is roughly the maximum you can reach with a mouse you'll see that we've reached maximum acceleration at a third of the maximum speed and the window where we have adaptive acceleration is tiny.

Tweaking threshold/accel doesn't do that much. Below are the two graphs representing the default (threshold=0.4, accel=2), a doubled threshold (threshold=0.8, accel=2) and a doubled acceleration (threshold=0.4, accel=4).

Figure 6: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.
Figure 7: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, see Figure 5 for details. Zoomed in to a max of 120 t0.4 a4 (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.
Doubling either setting just moves the adaptive window around, it doesn't change that much in the grand scheme of things.

Now, of course these were all fairly simple examples with constant speed, etc. Let's look at a diagram of what is essentially random movement, me clicking around in Firefox for a bit:

Figure 8: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set.
And the zoomed-in version of this:
Figure 9: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set, zoomed in to events 450-550 of that set.
This is more-or-less random movement reflecting some real-world usage. What I find interesting is that it's very hard to see any areas where smoothing takes visible effect. the accelerated curve largely looks like a stretched input curve. tbh I'm not sure what I should've expected here and how to read that, pointer acceleration data in real-world usage is notoriously hard to visualize.

Summary

So in summary: I think there is room for improvement. We have no acceleration up to the threshold, then we accelerate within too small a window. Acceleration stops adjusting to the speed soon. This makes us lose precision and small speed changes are punished quickly.

Increasing the threshold or the acceleration factor doesn't do that much. Any increase in acceleration makes the mouse faster but the adaptive window stays small. Any increase in threshold makes the acceleration kick in later, but the adaptive window stays small.

We've already merged a number of fixes into libinput, but some more work is needed. I think that to get a good pointer acceleration we need to get a larger adaptive window [Citation needed]. We're currently working on that (and figuring out how to evaluate whatever changes we come up with).

A word on units

The biggest issue I was struggling with when trying to understand the code was that of units. The code didn't document used units anywhere but it turns out that everything was either in device units ("mickeys"), device units/ms or (in the case of the acceleration factors) was unitless.

Device units are unfortunately a pretty useless base entity, only slightly more precise than using the length of a piece of string. A device unit depends on the device resolution and of course that differs between devices. An average USB mouse tends to have 400 dpi (15.75 units/mm) but it's common to have 800 dpi, 1000 dpi and gaming mice go up to 8200dpi. A touchpad can have resolutions of 1092 dpi (43 u/mm), 3277 dpi (129 u/mm), etc. and may even have different resolutions for x and y.

This explains why until commit e874d09b4 the touchpad felt slower than a "normal" mouse. We scaled to a magic constant of 10 units/mm, before hitting the pointer acceleration code. Now, as said above the mouse would likely have a resolution of 15.75 units/mm, making it roughly 50% faster. The acceleration would kick in earlier on the mouse, giving the touchpad and the mouse not only different speeds but a different feel altogether.

Unfortunately, there is not much we can do about mice feeling different depending on the resolution. To my knowledge there is no way to query the resolution on a device. But for absolute devices that need pointer acceleration (i.e. touchpads) we can normalize to a fake resolution of 400 dpi and base the acceleration code on that. This provides the same feel on the mouse and the touchpad, as much as that is possible anyway.

July 13, 2014
  • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it’s there at the bottom.
  • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images.
  • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense.
  • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated.

The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It’s called, “true” because the Aliasing PPGTT was introduced first and often was simply called, “PPGTT.”

The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn’t realistically be enabled in isolation of the existing driver. When regressions occur it’s likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you’re reading this in some future where the stability problems are fixed upstream.

Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won’t be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2

A Brief History of the i915 Graphics Process

There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU.

Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher.

File Descriptors

Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate “who” was doing “what.” This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it’s just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time.

File descriptor isolation.  Before hardware contexts.

File descriptor isolation.Before hardware contexts.

Hardware Contexts

The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided  by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there’s not really any point in lamenting about how it could be better, now.

The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple.

struct drm_i915_gem_context_create {
	/* output: id of new context*/
	__u32 ctx_id;
	__u32 pad;
};

struct drm_i915_gem_context_destroy {
	__u32 ctx_id;
	__u32 pad;
};

As you can see from the two IOCTL payloads above, I wasn’t lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn’t a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list.

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea.

Hardware context isolation

Hardware context isolation

Full PPGTT

The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect.

PPGTT, full isolation

PPGTT, full isolation

If I write about this picture, there would be no point in continuing with an organized blog post :-). So I’ll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

Since the GGTT isn’t really mentioned much in this post, I’d like to point out  that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post.

VMAs and Address Spaces (AKA VMs)

The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn’t much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering.

Sharing VMAs

You can’t (see the note at the bottom). There’s not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I’ll keep things general and brief.

It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client’s process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again:

  1. VMA: (BO, Address Space) // Some BO mapped by the address space.
  2. VMA′: (BO′, Address Space) // Another BO mapped into the address space
  3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space.
VMA : PPGTT :: BO : GGTT

M = {1,2,3,…} N = {1,2,3,…}

In case it’s still unclear, I’ll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object.

NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can’t think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do.

Data Structures

Here are the relevant data structures cropped for the sake of brevity.

struct i915_address_space {
        struct drm_mm mm;
	unsigned long start;            /* Start offset always 0 for dri2 */
	size_t total;           /* size addr space maps (ex. 2GB for ggtt) */
	struct list_head active_list;
	struct list_head inactive_list;
};

struct i915_hw_ppgtt {
        struct i915_address_space base;
	int (*switch_mm)(struct i915_hw_ppgtt *ppgtt,
			 struct intel_engine_cs *ring,
			 bool synchronous);

};
struct i915_vma {
        struct drm_mm_node node;
        struct drm_i915_gem_object *obj;
        struct i915_address_space *vm;
};

The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I’ve not opted to do this. I feel there is too much duplication for not enough benefit.

I’ve already explained in different words that a range of used address space is the VMA.  If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA.

HOLE 0x0->0x64000 VMA 1 0x64000->0x69000 HOLE 0x69000->512M VMA 2 512M->512.004M HOLE 1.5GB

HOLE 0×0->0×64000
VMA 1 0×64000->0×69000
HOLE 0×69000->512M
VMA 2 512M->512.004M
HOLE ~512M->2GB
Allocated space: 0×6000 Free space: 0x7fffa000

Relation to the Hardware Context

struct intel_context {
	struct kref ref;
	int id;
	...
	struct i915_address_space *vm;
};

With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs.

Implicit Context (“private default context”)

From here on out we can consider a, “context” as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I’ve called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it’s state from the previous render operation when using the IOCTLs.

The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does).  There are a few solutions to this problem, and I’ve submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say – every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL.

Multi Context

A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy).

struct drm_i915_gem_execbuffer2 {
	/**
	 * List of gem_exec_object2 structs
	 */
	__u64 buffers_ptr;
	__u32 buffer_count;

	/** Offset in the batchbuffer to start execution from. */
	__u32 batch_start_offset;
	/** Bytes used in batchbuffer from batch_start_offset */
	__u32 batch_len;
	...
	__u64 flags;
	__u64 rsvd1; /* now used for context info */
	__u64 rsvd2;
};

A process may wish to create several GL contexts. The API allows this, and for reasons I don’t understand, it’s something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we’ve discussed for a GL context.

The Big Picture – literally

Overview

Overview

Context:PPGTT

One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context.

Quoting myself from one of earlier public declarations, here:

My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed.

My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I’m still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you’d get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it’s a bit easier to think of things this way too.

Current Mesa Current DDX 2 hypothetical scenarios

Current Mesa
Current DDX
2 hypothetical scenarios

Conclusion

Please feel free to send me issues or questions.
Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone.

State Machine

State Machine

TODO

As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don’t yell at me when things break :-).

Summary

That’s most of it. I like to give the 10 second summary.

  1. i915_vma, i915_hw_ppgtt, i915_address_space: important things.
  2. The GPU has a virtual address space per DRI file descriptor.
  3. There is a connection between the PPGTT, and a Hardware Context.
  4. VMAs are backed by BOs which are backed by physical pages.
  5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT.

And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.

Thing Modern X86 CPU Modern i915 GPU
Phys Address Limit 48b? ~40b
Process Isolation Yes Yes (with True PPGTT)
Virtual Address Space Yes Yes
64b VA Space Yes GEN8+ 48b only
PTE access controls Yes No
Page Fault Handling Yes No
Preemption7 Yes *With execlists

So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU – it still has a ways to go. I would be surprised if things didn’t continue going in this direction.

SVG Links

As usual, please feel free to do something useful with the images I’ve created. Also as usual, they are really poorly named.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg

Download PDF

  1. It’s technically possible to make them be the same BO through the two buffer sharing mechanisms. 

  2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU “process” 

  3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow. 

  4. I would have preferred the reservation of a space within the address space be called a, “GVMA”, but that was shot down during review 

  5. There’s a whole section below describing how this statement could be false. For now, let’s pretend address spaces can’t be shared 

  6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you’d expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees. 

  7. I am defining the word preemption as the ability to switch at an arbitrary point in time between contexts. On the CPU this is easily accomplished. The GPU running the i915 driver as of today has no way to do this. Once a batch is running it cannot be interrupted except for RC6. 

July 12, 2014

EDIT1 (2014-07-12): Apologies to planets for update.

  • Change b->B (bits to bytes) in the state walkthrough (thanks to Bernard Kilarski)
  • Convert SVG images to PNG because they weren’t being rendered properly.
  • Added TOC
  • Use new style footnotes
  • NOTE: With command parser merged, and execlists on the way – this post is already somewhat outdated.

Disclaimer: Everything documented below is included in the Intel public documentation. Anything I say which offends you are my own words and not those of Intel. Sadly, anything I say that is of monetary value are those if Intel.

Intro

Goal

My goal is to lay down a basic understanding of how GEN GPU execution works using gem_exec_nop from the intel-gpu-tools suite as an example. One who puts in the time to read this should understand how command submission works for the i915 driver, and how gem_exec_nop tests command submission. You should also have a decent idea of how the hardware handles execution. I intentionally skip topics like relocations, and how graphics virtual addresses are maintained. They are not directly related towards execution, and would make the blog entry too long.

Ideally, I am hoping this will enable people who are interested to file better bugs, improve our tests, or write their own tests.

Terminology

  • i915: The name of the Linux kernel driver for Intel GEN graphics. i915 is the name of an ancient chipset that was one of the first supported by the driver. The driver itself supports chipsets both before, and after i915.
  • BO: Buffer Object. GEM uses handles to identify the buffers used as graphics operands in order to avoidly costly copies from userspace to kernel space. BO is the thing which is encapsulated by that handle.
  • GEM: Graphics Execution Manager. The name of a design and API to give userspace GPU clients the ability to execute work on a GPU (the API is technically not specific to GEN).
  • GEN: The name of the Graphics IP developed by Intel Corporation.
  • GPU client: A userspace application or library that submits GPU work.
  • Graphics [virtual] Address: Address space used by the GPU for mapping system memory. GEN is an UMA architecture with regard to the CPU.
  • NOP/NOOP: An  assembly instruction mnemonic for a machine opcode that does no work. Note that this is not the same as a lack of work. The instruction is indeed executed, it simply has no side-effects. The execution latency is strictly greater than zero.
  • relocations: The way in which GEM manages to make GPU clients agnostic to where the buffers are actually mapped by the GPU. Out of scope for this blog entry.

Source Code

The source code in this post is found primarily in two places. Note that the links below are both from very fast moving code bases.

The test case: http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tests/gem_exec_nop.c

The driver internals: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_execbuffer.c

GEN Hardware

Before going over gem_exec_nop, I’d like to give an overview of modern GEN hardware:

Coarse GEN block diagram.

Coarse GEN block diagram.

I don’t want to say this is the exhaustive list, and indeed, each block above has many sub-components. In the part of the driver I work on this is a pretty logical way to split it. Each of the blocks share very little. The common denominator is a Graphics Virtual Address which is understood by all blocks. This provides an easy communication for work needing to be sent between components. As an example, the command streamer might want the display engine to flip to a new surface. It does so by sending a special message to the display engine along with the address of the surface to flip to. The display engine may respond “out of band” via interrupts (flip completion). There are also built in synchronization primitives that allow the command streamer to wait on events sent by the display engine (we’ll get to the command streamer in more detail later).

Excluding audio, since I know nothing about audio… by a very rough estimate, 85% of the Linux i915.ko code falls into “Other.” Of the remaining 15% in graphics processing engine, the kernel driver tends to utilize very little of the Fixed Func/EU block above. Total lines of code outside of the kernel driver for the EU block is enormous, given that the X 2d driver (DDX), mesa, libva, and beignet all have tons of lines of code just for utilizing that part of the hardware.

gem_exec_nop

gem_exec_nop is one of my favorite tests. For me, it’s the first test I run to determine whether or not to even bother with the rest of the test suite.

  • It’s dead simple.
  • It’s fast.
  • It tests a surprisingly large amount of the hardware, and software.
  • Gives some indication of performance
  • It’s deader than dead simple

It’s not a perfect test, some of the things which are missing:

  • Handling under memory pressure (relocs, swaps, etc.)
  • Tiling formats
  • Explicit testing of cacheability types, and coherency (LLC et al.)
  • several GEM interfaces
  • The aforementioned 85% of the driver
  • It doesn’t even execute a NOP instruction!!!

gem_exec_nop flowchart

NOTE: I will explain more about what a batchbuffer is later.

execbuf_5_steps

* (step 1) The docs say we must always follow MI_BATCH_BUFFER_END with an MI_NOOP. The presumed reason for this is that the hardware may prefetch the next instruction, and I think the designers wanted to dumb down the fact that they can't handle a pagefault on the prefetch, so they simply demand a MI_NOOP.
** (step1) MI_NOOP is defined as a dword of value 0x00000000. GEM BOs are 0 by default. So we have an implicit buffer full of MI_NOOP.
  1. Creating a batchbuffer is done using GEM APIs. Here we create a batchbuffer of size 4096, and fill in two instructions. The batchbuffer is the basic unit of execution. The only pertinent point to keep in mind is this is the only buffer being created for this test. Note that this step, or a similar one, is done in almost every test.
  2. Here we set up the data structure that will be passed to the kernel in an IOCTL. There’s a pointer to the list of buffers, in our case, just the one batchbuffer created instead one. The size of 8 (each of the two instructions is 4 bytes), and some flags which we’ll skip for now are also included in the struct.
  3. The dotted line through step 3 denotes the userspace/kernel barrier. Above the line is gem_exec_nop.c, below is i915_gem_execbuffer.c. DRM, which is a common subsystem interface, actually dispatches the IOCTLs to the i915 driver.
  4. The kernel handling the data is received. Talked about in more detail later.
  5. Submit to the GPU for execution. Also, detailed later.

Execbuf2 IOCTL and Hardware Execution

 i915.ko execbuffer2 handling (step 4 and 5 in the picture above)

The eventual goal of the kernel driver is to take the batchbuffer passed in from userspace, make sure it is visible to the GPU by mapping it, and then submit it to the GPU for execution. The aforementioned operations are synchronous with respect to the IOCTL1. In other words, by the time the execution returns to the application, the GPU knows about the work. The work is completed asynchronously.

I’ll detail some of the steps a bit. Unfortunately, I do not have pretty pictures for this one. You can follow along in i915_gem_execbuffer.c; i915_gem_do_execbuffer()

  1. copy_from_user – Copy the BOs in from userspace. Remember that the BO is a handle and not actual memory being copied; this allows a relatively small and fast copy to take place. In gem_exec_nop, there is exactly 1 BO: the batchbuffer.
  2. some sanity checks – not interesting
  3. look up – Do a lookup of all the handles for the BOs passed in via the buffers_ptr member (copied in during #1). Make sure the buffers still exist and so on. In our case this is only one buffer and it’s unlikely that it would be destroyed before execbuffer completes2
  4. Space reservation – Make sure there is enough address space in the GPU for the objects. This also includes checking for various alignment restrictions, and a few other details not really relevant to this specific topic. For our example, we’ll have to make sure we have enough space for 1 buffer of size 4096, and no special alignment requirements. It’s the second simplest request possible (first would be to have no buffers).
  5. Relocations – save for another day.
  6. Ring synchronization – Also not pertinent to gem_exec_nop. Since it involves the command streamer, I’ll include a brief description as a footnote3
  7. Dispatch – Finally we can tell the GEN hardware about the work that we just got. This means using some architectural registers to point the hardware at the batchbuffer which was submitted by userspace. More on this shortly…
  8. Some more relocation stuff – save for another day

Execution part I (Command Streamer/Ringbuffer)

Fundamentally, all work is submitted via a hardware ringbuffer, and fetching via the command streamer. A command streamer is many things, but for now, saying it’s a DMA engine for copying in commands and associated data is a good enough definition. The ringbuffer is a canonical ringbuffer with a HEAD and TAIL pointer (to be clear: TAIL is the one incremented by the CPU, and read by the GPU. HEAD is written by the GPU and read by the CPU). There is a third pointer known as ACTHD (or Active HEAD) – more on this later. At driver initialization, the space for the ringbuffer is allocated, and the address and size is written to hardware registers. When the driver wants to submit work, it writes data at the current TAIL pointer, and increments the TAIL pointer. Once the TAIL is incremented, the hardware will start reading in commands (via DMA), and increment the HEAD (and ACTHD) pointer as commands are retired.

Early GEN hardware had only 1 command streamer. It was referred to as, “CS.” When Ironlake introduced the VCS, or video engine command streamer, they renamed (in some places) the original CS to RCS, for render engine command streamer. Sandybridge introduced the blit engine command streamer BCS, and Haswell the video enhancement command streamer, or VECS. Each command streamer supports its own instruction set, though many instructions are the same on multiple command streamers, MI_NOOP is supported on all of them :P Having multiple command streamers not only provides an easy way to add new instructions, but it also allows an asynchronous way to submit work, which can be very useful if you are trying to do two orthogonal tasks. As an example, take an OpenCL application running in conjunction with your favorite 3d benchmark. The 3d benchmark internally will only use the 3d and blit hardware, while the OCL application will use the GPGPU hardware. It doesn’t make sense to have either one wait for a single command streamer to fetch the data (especially since I glossed over some other details which make it an even worse idea) if there won’t be any [or few] data dependencies.

The kernel driver is the only entity which can insert commands into the ringbuffer. The ringbuffer is therefore considered trusted, and all commands supported by the hardware may be run here (the docs use the word, “secure” but this gets confusing quickly). The way in which the batchbuffer we created in gem_exec_nop gets executed will be explained a bit further shortly, but the contents of that batchbuffer are not directly inserted into the ringbuffer4. Take a quick peek at the text in the image below for how it works.

Here is a pretty basic picture describing the above. The HEAD and TAIL point to the next instruction to be executed, therefore this would be midway through step #5 in the flowchart above.

ringbuffer

Execution part II (MI_BATCH_BUFFER_START, batchbuffer)

A batchbuffer is the way in which we can submit work to the GPU without having to write into the hardware ringbuffer (since only the kernel driver can do that). A batchbuffer is submitted to the GPU for execution via a command called MI_BATCH_BUFFER_START which is inserted into the ringbuffer and read by the command streamer. Batchbuffers share an instruction set with the command streamer that dispatched them (ie. batches run by the blit engine can issue blit commands), and the execution flow is very similar to that of the command streamer as described in the first diagram, and subsequently. On the other hand, there are quite a few differences. Batchbuffer execution is not guided by HEAD, and TAIL pointers. The hardware will continue to execute every instruction in a batchbuffer until it hits another MI_BATCH_BUFFER_START command, or an MI_BATCH_BUFFER_END. Yes, you can get into an infinite loop of batchbuffers with this nesting of MI_BATCH_BUFFER_START commands. The hardware has an internal HEAD pointer which is exposed for debug purposes called ACTHD. This pointer works exactly like a HEAD point would, except it is never compared against TAIL to determine the end of execution5. MI_BATCH_BUFFER_END will directly guide execution back the hardware ringbuffer. In other words you need only one MI_BATCH_BUFFER_END to break the chain of n MI_BATCH_BUFFER_STARTs.

Getting back to gem_exec_nop specifically for a sec: this is what we set up in step #1. Recall it had 2 instructions MI_BATCH_BUFFER_END, MI_NOOP.

batch

Here is our graphical representation of the batchbuffer from gem_exec_nop. Notice that the batchbuffer doesn’t have a tail pointer, only ACTHD.

Hardware states

The follow macro-level state machine/flowchart hybrid can be used to describe both ringbuffer execution and batchbuffer execution, though the descriptions differ slightly. By “macro-level” I mean each state may not match exactly to a state within the hardware’s state machines. It’s more of a state in the data flow. The “state machines” for both ringbuffers and batchbuffers are pretty similar. What follows is a diagram that mostly works for both, and a description of each state.

cs_state_machine

I’ll use “RSn” for ringbuffer state n, and “BSn” for batchbuffer state n.

  • RS0: Idle state, HEAD == TAIL. Waiting for driver to increment tail.
  • RS1: TAIL has changed. Fetch some amount between HEAD and TAIL (I’d guess it fetches the whole thing since the ringbuffer size is strictly limited).
  • RS2: Fetch has completed, and command parsing can begin. Command parsing here is relatively easy. Every command is 4B aligned, and has the total command length embedded in the first 4th (1 based) byte of the opcode. Once it has determined the length, it can send that many dwords to the next stage.
  • RS3: 1 command has been parsed and sent to be executed (pun intended).
  • RS4: Execute phase required some more work, if the command executed in RS3 requires some extra data, now is when it will get fetched – and AFAICT, the hardware will stall waiting for the fetch to complete. If there is nothing left to do for the command, HEAD is incremented. Most commands will be done and increment HEAD. MI_BATCH_BUFFER_START is a common exception.  I wish I could easily change the image… this is really RS3.5.
  • RS5: An error state requiring a GPU reset.
  • BS0: ASSERT(last command != MI_BATCH_BUFFER_END) This isn’t a real state. While executing a batchbuffer, you’re never idle. We can use this state as a place to update ACTHD though, so let’s say ACTHD := batchbuffer start address.
  • BS1: Similar to RS1, fetch the data. Hopefully most of it exists in some internal cache since we had to fetch some amount of it in RS4, but I don’t claim to know the micro-architecture details on this.
  • BS2: Just like RS2
  • BS3: Just like RS3
  • BS4: Just like RS4

gem_exec_nop state walkthrough

With the above knowledge, we can now step through the actual stuff from gem_exec_nop. This combines pretty much all the diagrams above (ie. you might want to reference them), I tried to keep everything factually correct along the way minus the address I make up below. Assume HEAD = 0×30, TAIL = 0×30, ACTHD=0×30

  1. Hardware is in Rs0.
  2. gem_exec_nop runs; submits previously discussed setup to i915.
  3. *** kernel picks address 0×22000 for the batchbuffer (remember I said we’re ignoring how graphics addresses work for now, so just play along)
  4. i915.ko writes 4 bytes, MI_BATCH_BUFFER_START to hardware ringbuffer.
  5. i915.ko writes 4 bytes, 0×22000 to hardware ringbuffer.
  6. i915.ko increments the tail pointer by command length (8). TAIL := 0×38
  7. RS0->RS1:  DMA fetches TAIL-HEAD bytes. (0×38-0×30) = 8B
  8. RS1->RS2: DMA completes. Parsing will find that the command is MI_BATCH_BUFFER_START, and it needs 1 extra dword to proceed. This 8B command is then ready to move on.
  9. RS2->RS3: Command was successfully parsed. There is a batchbuffer to be fetched, and once that completes we need to execute it.
  10. RS3->RS4: Execution was okay, DMA fetch of the batchbuffer at 0×22000 starts…completes
  11. RS4->BS0: ACTHD := 0×22000
  12. BS0->BS1: We’re in a batchbuffer. The commands we need to fetch are in our local cache, fetched by the ringbuffer just before so no need to do anything more.
  13. BS1->BS2: Parsing of the batchbuffer begins. The first command pointed to by ACTHD is MI_BATCH_BUFFER_END. It is only 4b.
  14. BS2->BS3: Parse was successful. Execute the command MI_BATCH_BUFFER_END. ACTHD := 4. There are no extra requirements for this command.
  15. BS3->RS0: Batchbuffer told us to end, so we go back to the ring. Increment our HEAD pointer by the size of the last command (8B). Set ACTHD equal to HEAD. HEAD := 0×38. ACTHD := 0×38.
  16. HEAD == TAIL… we’re idle.

Summary

User space builds up a command, and list of buffers. Then the userspace tells the kernel about it via IOCTL. Kernel does some work on the command to find all the buffers and so on, then submits it to the hardware. Some time later, userspace can see the results of the commands (not discussed in detail). On the hardware side, we’ve got a ringbuffer with a head and tail pointer, a way to dispatch commands which are located sparsely in our address space, and a way to get execution back to the ringbuffer.

SVG links

https://bwidawsk.net/blog/wp-content/uploads/2014/07/gen_block_diagram.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/execbuf_5_steps.svg

Download PDF

  1. The synchronous nature of the IOCTL is something which has been discussed several times. One usage model which would really like to break that is a GPU scheduler. In the case of a scheduler, we’d want to queue up work and return to userspace as soon as possible; but that work may not yet make it into the hardware. 

  2. Buffer objects are managed with a reference count. When a buffer is created, it gets a ref count of 1, and the refcount is decremented either when the object is explicitly destroyed, or the application ceases to exist. Therefore, the only way gem_exec_nop can fail during the look up portion of execbuffer, is if the application somehow dies after creating the batchbuffer, but before calling the execbuffer IOCTL. 

  3. As I showed in the first diagram we consider the command executed to be “in order.” Here this means that commands are executed sequentially, (and hand waving over some caching stuff) the side effects of commands are completed by execution of the later commands. This made the implicit synchronization that is baked in to the GEM API really easy to handle (the API has no ways to explicitly add synchronization objects). To put this another way, if a GPU client submits a command that operates on object X, then a second command also operating on object X; it was guaranteed to execute in that order (as long as there was no race condition in userspace submitting commands). However, when you have multiple instances of the in-order command streamers, synchronization is no longer free. If a command is submitted to command streamer1 referencing object X, and then a second command is submitted to command streamer2 also referencing object X… no guarantees are made by hardware about the order the of the commands. In this case, synchronization can be achieved in two ways: hardware based semaphores, or by stalling on the second command until that first one completes.
     

  4. Certain commands which may provide security risks are not allowed to be executed by untrusted entities. If the hardware parses such a command from an untrusted entity, it will convert it into an MI_NOOP. Batchbuffers can be executed in a trusted manner, but implementing such a thing is complex.
     

  5. When the CS is execution from the ring, HEAD == ACTHD. Once the CS jumps into the batchbuffer, ACTHD will take on the address within the batchbuffer, while HEAD will remain only relevant to it’s position in the ring. We use this fact to help us debug whether we hung in the batch, or in the ring. 

July 10, 2014

One feature we are spending quite a bit of effort in around the Workstation is container technologies for the desktop. This has been on the wishlist for quite some time and luckily the pieces for it are now coming together. Thanks to strong collaboration between Red Hat and Docker we have a great baseline to start from. One of the core members of the desktop engineering team, Alex Larsson, has been leading the Docker integration effort inside Red Hat and we are now preparing to build onwards on that work, using the desktop container roadmap created by Lennart Poettering.

So while Lennarts LinuxApps ideas predates Docker they do provide a great set of steps we need to turn Docker into a container solution not just for server and web applications, but also for desktop applications. And luckily a lot of the features we need for the desktop are also useful for the other usecases, for instance one of the main things Red Hat has been working with our friends at Docker on is integrating systemd with Docker.

There are a set of other components as part of this plan too. One of the big ones is Wayland, and I assume that if you are reading this you
have already seen my Wayland in Fedora updates.

Two other core technologies we identified are kdbus and overlayfs. Alex Larsson has already written an overlayfs backend for Docker, and Fedora Workstation Steering committee member, Josh Bowyer, just announced the availability of a Copr which includes experimental kernels for Fedora with overlayfs and kdbus enabled.

In parallel with this, David King has been prototyping a version of Cheese that can be run inside a container and that uses this concept that in the LinuxApps proposal is called ‘Portals’, which is basically dbus APIs for accessing resources outside the container, like the webcam and microphone in the case of Cheese. For those interested he will be presenting on his work at GUADEC at the end of the Month, on Monday the 28th of July. The talk is called ‘Cheese: TNG (less libcheese, more D-Bus)’

So all in all the pieces are really starting to come together now and we expect to have some sessions during both GUADEC and Flock this year to try hammer out the remaining details. If you are interested in learning more or join the effort be sure to check the two conferences notice boards for time and place for the container sessions.

There is still a lot of work to do, but I am confident we have the right team assembled to do it. In addition to the people already mentioned we for instance have Allan Day who is kicking off an effort to look at the user experience we want to provide around the container hosted application bundles in terms of upgrades and installation for instance. And we will also work with the wider Docker community to make sure we have great composition tools for creating these container images available for developers on Fedora.

July 04, 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!

Update: I had actually managed to disable the VAAPI encoding in 1.2, so I just rolled a 1.3 release which re-enabled it. Apart from that it is identical

So I finally managed to put out a new Transmageddon release today. It is primarily a bugfix release, but considering how many critical bugs I ended up fixing for this release I am actually a bit embarassed about my earlier 1.x releases. There was for instances some stupidity in my code that triggered thread safety issues, which I know hit some of my users quite badly. But there were other things not working properly either, like dropping the video stream from a file. Anyway, I know some people think that filing bugs doesn’t help, but I think I fixed every reported Transmageddon bug with this release (although not every feature request bugzilla item). So if you have issues with Transmageddon 1.2 please let me know and I will try my best to fix them. I do try to keep a policy that it is better to have limited functionality, but what is there is solid as opposed to have a lot of features that are unreliable or outright broken.

That said I couldn’t help myself so there are a few new features in this release. First of all if you have the GStreamer VAAPI plugins installed (and be sure to have the driver too) then the VAAPI GPU encoder will be used for h264 and MPEG2.

Secondly I brought back the so called ‘xvid’ codec (even though xvid isn’t really a separate codec, but a name used to refer to MPEG4 Video codec using the advanced-simple profile.).

So as screenshot blow shows, there is not a lot of UI changes since the last version, just some smaller layout and string fixes, but stability is hopefully greatly improved.
transmageddon-1.2

I am currently looking at a range of things as the next feature for Transmageddon including:

  • Batch transcoding, allowing you to create a series of transcoding jobs upfront instead of doing the transcodes one by one
  • Advanced settings panel, allowing you to choose which encoders to use for a given format, what profiles to use, turn deinterlacing on/off and so on
  • Profile generator, create new device profiles by inspecting existing files
  • Redo the UI to switch away from deprecated widgets

If you have any preference for which I should tackle first feel free to let me know in the comments and I will try to allow
popular will decide what I do first :)

P.S. I would love to have a high contrast icon for Transmageddon (HighContrast App icon guidelines) – So if there is any graphics artists out there willing to create one for me I would be duly greatful

July 03, 2014

As we are approaching Fedora Workstation 21 we held a meeting to review our Wayland efforts for Fedora Workstation inside Red Hat recently. Switching to a new core technology like Wayland is a major undertaking and there are always big and small surprises that comes along the way. So the summary is that while we expect to have a version of Wayland in Fedora Workstation 21 that will be able to run a fully functional desktop, there are some missing pieces we now know that will not make it. Which means that since we want to ship at least one Fedora release with a feature complete Wayland as an option before making it default, that means that Fedora Workstation 23 is the earliest Wayland can be the default.

Anyway, here is what you can expect from Wayland in Fedora 21.

  • Wayland session available in GDM (already complete and fully working)
  • XWayland working, but without accelerated 3D (done, adding accelerated 3D will be done before FW 22)
  • Wayland session working with all free drivers (Currently only Intel working, but we expect to have NVidia and AMD support enabled before F21)
  • IBUS input working. (Using the IBUS X client. Wayland native IBUS should be ready for FW22.)
  • Touchpad acceleration working. (Last missing piece for a truly usable Wayland session, lots of work around libinput and friends currently to have it ready for F21).
  • Wacom tablets will not be ready for F21
  • 3D games should work using the Wayland backend for SDL2. SDL1 games will need to wait for FW22 so they can use the accelerated XWayland support).
  • Binary driver support from NVidia and AMD very unlikely to be ready for F21.
  • Touch screen support working under Wayland.

We hope to have F21 testbuilds available soon that the wider community can use to help us test, because even when all the big ‘checkboxes’ are filled in there will of course be a host of smaller issues and outright bugs that needs ironing out before Wayland is ready to replace X completely. We really hope the community will get involved with testing Wayland so that we can iron out all major bugs before F21.

How to get involved with the Fedora Workstaton effort

To help more people get involved we recently put up a tasklist for the Fedora Workstation. It is a work in progress, but we hope that it will help more people get involved and help move the project forward.

UpdatePeter Hutterer posted this blog entry explaining pointer acceleration and what are looking at to improve it.

June 26, 2014

Hi folks,

Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later.

To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them.

ioctl calls vs software methods

An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl

Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example.

A first prototype using software methods

Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods.

Configure a counter

To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods.

Sample a counter

This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio.

Expose counter’s data to the userspace

Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries.

Synchronize queries with a sequence number

To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one.

This buffer looks like this :

 

schema_notifer_bo

To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it.

Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation.

I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them.

Design and implement a kernel interface with an elegant way takes a while…

See you soon for the full implementation!


June 25, 2014
Firewalls

Fedora has had problems for a long while with the default firewall rules. They would make a lot of things not work (media and file sharing of various sorts, usually, whether as a client or a server) and users would usually disable the firewall altogether, or work around it through micro-management of opened ports.

We went through multiple discussions over the years trying to break the security folks' resolve on what should be allowed to be exposed on the local network (sometimes trying to get rid of the firewall). Or rather we tried to agree on a setup that would be implementable for desktop developers and usable for users, while still providing the amount of security and dependability that the security folks wanted.

The last round of discussions was more productive, and I posted the end plan on the Fedora Desktop mailing-list.

By Fedora 21, Fedora will have a firewall that's completely open for the user's applications (with better tracking of what applications do what once we have application sandboxing). This reflects how the firewall was used on the systems that the Fedora Workstation version targets. System services will still be blocked by default, except a select few such as ssh or mDNS, which might need some tightening.

But this change means that you'd be sharing your music through DLNA on the café's Wi-Fi right? Well, this is what this next change is here to avoid.

Per-network Sharing

To avoid showing your music in the caf, or exposing your holiday photographs at work, we needed a way to restrict sharing to wireless networks where you'd already shared this data, and provide a way to avoid sharing in the future, should you change your mind.

Allan Day mocked up such controls in our Sharing panel which I diligently implemented. Personal File Sharing (through gnome-user-share and WedDAV), Media Sharing (through rygel and DLNA) and Screen Sharing (through vino and VNC) implement the same per-network sharing mechanism.

Make sure that your versions of gnome-settings-daemon (which implements the starting/stopping of services based on the network) and gnome-control-center match for this all to work. You'll also need the latest version of all 3 of the aforementioned sharing utilities.

(and it also works with wired network profiles :)



Lately at Collabora I have been working on helping Mozilla with the GTK+ 3 port of Firefox.

The problem

The issue we had to solve is that GTK+ 2 and GTK+ 3 cannot be loaded in the same address space. Moving Firefox from GTK+ 2 to GTK+ 3 isn’t a problem, as only GTK+ 3 gets loaded in its address space, and everything is fine. The problem comes when you load a plugin that links to GTK+ 2, e.g. Flash. Then, GTK+ 2 and GTK+ 3 get both loaded, GTK+ detects that, and aborts to avoid bigger problems. This was tracked as bug #624422.

More specifically, Firefox links to libxul.so, which in turn links to GTK+. These days, the plugins are loaded in a separate process, plugin-container, which communicates with the Firefox process through IPC. If plugin-container didn’t link to GTK+, there would be absolutely no problem, as the browser (Firefox) process could link to GTK+ 3 and plugin-container could load any plugin, including GTK+ 2 ones. However, although plugin-container doesn’t directly use GTK+, it links to libxul.so for IPC, which brings GTK+ into its address space.

The solution

In order to solve this, we evaluated various options. The first one was to split libxul.so in two parts, one with the IPC code and lower level stuff, which wouldn’t link to GTK+, and another side with the rest of the code, including all the widget and toolkit integration, which would obviously link to GTK+. However this turned not to be possible as the libxul code was too intricate.

In the end, we decided to add a thin layer between libxul and GTK+, which we called libmozgtk.so. This small layer links to GTK+ 3, and provides stubs for GTK+ 2 specific symbols. Additionally, there is a libmozgtk2.so with SONAME “libmozgtk.so”, which links to GTK+ 2 and provides stubs for GTK+ 3 symbols. We made libxul link against libmozgtk.so, and so when Firefox runs, libxul.so, libmozgtk.so, and GTK+ 3 are loaded, and Firefox uses GTK+ 3. However when plugin-container is executed, we add LD_PRELOAD=libmozgtk2.so in the environment. Since libmozgtk2.so has a libmozgtk.so SONAME, the libxul.so dependency is satisfied, and the plugin-container process ends with GTK+ 2. Since plugin-container doesn’t make use of the GTK+ code in libxul, this is safe, and we end up with a GTK+ 3 Firefox that can load GTK+ 2 plugins. The end result is that you can watch Youtube videos again!

While this solution is somewhat hacky, it means we didn’t need to mess with libxul, splitting it in two just for the Linux/GTK+ port’s sake. And when the GTK+ 2 plugins become irrelevant, or NPAPI support is removed (as it recently happened in Chrome), we should be able to easily revert this and use GTK+ 3 everywhere.

Wayland

On an unrelated note, we have looked a bit at porting Firefox to Wayland. Wayland is designed to be a replacement for X11, and is becoming very popular in the digital TV and set top box space. Those obviously need HTML engines and web browsers, and with WebKit and Chrome already having Wayland ports, we think Firefox shouldn’t fall behind.

For this, the GTK+ 3 port was a prerequisite, but that isn’t enough. There are many X11 uses on the Firefox codebase, most of which are guarded by #ifdef MOZ_X11, though not all of them are. We got Firefox to start on Weston (the Wayland reference compositor) with a bunch of hacks, one of which broke keyboard input (but avoided a segfault). As you can see from the screenshot, things aren’t perfect, but it’s at least a good start!

Firefox running on Weston

June 23, 2014
This will, I think, be the first time blogging about something quite so retroactively, but for reasons which should be apparent, I could not blog about this little adventure until now.  This is the story of CVE-2014-0972 (QCIR-2014-00004-1), and (at least part of) how I was able to install fedora on my firetv:

Introduction..

Back in April, I bought myself a Fire TV, with the thought that it would make a nice fedora xbmc htpc setup, complete with open src drivers, to replace my aging pandaboard.  But, of course, as delivered the Fire TV is locked down with no root access.

At the same time, there was a feature of the downstream android kernel gpu driver (kgsl), per-context pagetables, which had been on my TODO list for the upstream drm/msm driver for a while now.  But, I needed to understand better what kgsl was doing and the interactions with the hardware, in particular the behaviour of the CP (command processor), in order to convince myself that such a feature was safe.  People generally frown on introducing root holes in the upstream kernel, and I didn't exactly have documentation about the hardware.  So it was time to roll up my sleeves and get some hands-on experience (translation: try to poke and crash the gpu in lots of different ways and try to make sense of the result).

Into the rabbit hole..

The modern snapdragon SoCs use IOMMUs everywhere.  Including the GPU.  To implement per-context gpu pagetables, basically all the driver needs to do is to bang a few IOMMU registers to change the pagetable base addr and invalidate the TLB.  But this must be done when you are sure the GPU is not still trying to access memory mapped in the old page tables.  Since a GPU is a highly asynchronous device, it would be a big performance hit to stall until GPU ringbuffer drains, then reprogram IOMMU, then resume the GPU with commands from the new context.  To avoid this performance hit, kgsl maps some of the IOMMU registers into the GPU's virtual address space, and emits commands into the ringbuffer for the CP to write the necessary registers to switch pagetables and invalidate TLB.

It was this reprogramming of IOMMU from the GPU itself which I needed to understand better.  Anyone who understands GPU's would have the initial reaction that this is extremely dangerous.  But kgsl was, it seemed, taking some protections.  However, I needed to be sure I properly understood how this worked, to see if there was something that was overlooked.

The GPU, in fact, has two hw contexts which it can switch between.  Essentially it is in some ways similar to supervisor vs user context on a CPU.  The way kgsl uses this is to map the IOMMU registers into the supervisor context, but not user contexts.  The ringbuffer is mapped into all the user contexts, plus supervisor context, at the same device virtual address.  The idea being that if the ringbuffer is mapped in the same position in all contexts, you can safely context switch from commands in the ringbuffer.

To do this, kgsl emits commands for the CP to write a special bit in CP_STATE_DEBUG_INDEX to switch to the "supervisor" context.  Then commands to write IOMMU registers, followed by write to CP_STATE_DEBUG_INDEX to switch back to user context.  (I'm over-simplifying slightly, as there are some barriers needed to account for asynchronous writes.)  But userspace constructed commands never execute from the ringbuffer, instead the kernel puts an IB (indirect branch) into the ringbuffer to jump to the userspace constructed cmdstream buffer.  This userspace cmdstream buffer is never mapped into supervisor context, or into other user's contexts.  So in theory, if userspace tried to write CP_STATE_DEBUG_INDEX to switch to supervisor mode (and gain access to the IOMMU registers), the GPU would immediately page fault, since the cmdstream it was in the middle of executing is no longer mapped.  Ok, so far, so good.

Where it breaks down..

From my attempts at switching to supervisor mode from IB1, and deciphering the fault address where the gpu crashed, and iommu register dumps, I could tell that the next few commands after the switch to supervisor mode where excuted without problem.. there is some prefetch/pipelining!

But much more conveniently, while poking around, I realized that there were a couple pages mapped globally (in supervisor and all user contexts), which where mapped writable in user contexts.  I used the so called "setstate" buffer.  So I simply had to construct a cmdstream buffer to write the commands I wanted to execute into the setstate buffer, and then do an IB to that buffer and do the supervisor switch in IB2.

Ok.. but do do anything useful with this, I'd need a reasonable chunk of physically contiguous pages, at a known physical address.. in particular 16K for first level pagetables and 16K second level pagetables.  Fortunately ION comes to the rescue here, with it's physically contiguous carveouts at known physical addresses.  In this case, allocate from the multimedia pool when there is no video playback, etc, going on.  This way ION allocates from the beginning of the carveout pool, a known address.

Into this buffer, construct a new set of pagetables, which map whatever physical address you want to read/write (hint, any of kernel lowmem), a replacement page for the setstate buffer (since we don't know the original setstate buffer's physical address.. which means we actually have two copies of the commands copied into setstate buffer, one copied via gpu to original setstate page, and one written directly by cpu in the replacement setstate page).


The proof of concept that I made simply copied the string "Kilroy was here" into a kernel buffer.  But quite easily any random app downloaded from an untrusted source could access any memory, become root, etc.  Not the sort of thing you want falling into the wrong hands.

Once I managed to prove to myself that I understood properly how the hw was working, I wrote up a short report, and submitted it (plus proof of concept) to the qualcomm security team.

Now that the vulnerability is no longer embargoed, I've made available the proof of concept and report here.

Originally I planned to (once fixes were pushed out, so as to not put someone who did not intend to root their device at risk) release a jailbreak based on this vulnerability.  But once towelroot was released, there was no longer a need for me to turn this into an actual firetv jailbreak.  Which saves me from having to figure out how to make an apk.

Parting thoughts..

  1. Well, knownledge about physical addresses and contiguous memory in userspace, while it might not be a security problem in and of itself, it sure helps turn other theoritical exploits into actual exploits.
  2. As far as downstream vendor drivers go, the kgsl driver is actually pretty decent, in terms of code quality, etc.  I've seen far worse.  Admittedly this was not a trivial hole.  But imagine what issues lurk in other downstream gpu/camera/video/etc drivers.  Security is often not simple, and I really doubt whether the other downstream drivers are getting a critical look (from good-guys who will report the issue responsibly).
  3. I used to think of the whole one-kernel-branch-per-device wild-west ways of android as a bit of a headache.  Now I realize it is a security nightmare.  An important part of platform security is being able to react quickly when (not if) vulnaribilites are found.  In the desktop/server world, CVEs are usually not embargoed for more than a week.. that is all you need, since fortunately we don't need a different kernel for each different make and model of server, laptop, etc.  In the mobile device world, it is quite a different story!

June 22, 2014
It's been a week now, and I've made surprising amounts of progress on the project.

I came in with this giant task list I'd been jotting down in Workflowy (Thanks for the emphatic recommendation of that, Qiaochu!). Each of the tasks I had were things where I'd have been perfectly unsurprised if they'd taken a week or two. Instead, I've knocked out about 5 of them, and by Friday I had phire's "hackdriver" triangle code running on a kernel with a relocations-based GEM interface. Oh, sure, the code's full of XXX comments, insecure, and synchronous, but again, a single triangle rendering in a month would have been OK with me.

I've been incredibly lucky, really -- I think I had reasonable expectations given my knowledge going in. One of the ways I'm lucky is that my new group is extremely helpful. Some of it is things like "oh, just go talk to Dom about how to set up your serial console" (turns out minicom fails hard, use gtkterm instead. Also, someone else will hand you a cable instead of having to order one, and Derek will solder you a connector. Also, we hid your precious dmesg from the console after boot, sorry), but it extends to "Let's go have a chat with Tim about how to get modesetting up and running fast." (We came up with a plan that involves understanding what the firmware does with the code I had written already, and basically whacking a register beyond that. More importantly, they handed me a git tree full of sample code for doing real modesetting, whenever I'm ready.).

But I'm also lucky that there's been this community of outsiders reverse engineering the hardware. It meant that I had this sample "hackdriver" code for drawing a triangle with the hardware entirely from userspace, that I could incrementally modify to sit on top of more and more kernel code. Each step of the way I got to just debug that one step to go from "does not render a triangle" back to "renders that one triangle." (Note: When a bug in your command validator results in pointing the framebuffer at physical address 0 and storing the clear color to it, the computer will go away and stop talking to you. Related note: When a bug in your command validator results in reading your triangle from physical address 0, you don't get a triangle. It's like a I need a command validator for my command validator.).

https://github.com/anholt/linux/tree/vc4 is the code I've published so far. Starting Thursday night I've been hacking together the gallium driver. I haven't put it up yet because 1) it doesn't even initialize, but more importantly 2) I've been using freedreno as my main reference, and I need to update copyrights instead of just having my boilerplate at the top of everything. But next week I hope to be incrementally deleting parts of hackdriver's triangle code and replacing it with actual driver code.
June 20, 2014

NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows.

This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) .  To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows.

In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ?

Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ?

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation.

Case #2 : How NVPerfKit handles only one counter per domain ?

This is the simplest case, and there are no particular requirements.

Case #3 : How NVPerfKit handles multiple counters per domain ?

NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame.

Case #4 : How NVPerfKit handles multiple counters on different domains ?

No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor).

To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain.

Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference):

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week.

See you the next week!


June 18, 2014
bartholomea-annulata

Bartholomea annulata | (c) Kevin Bryant

It is time for a new Tanglu update, which has been overdue for a long time now!

Many things happened in Tanglu development, so here is just a short overview of what was done in the past months.

Infrastructure

Debile

The whole Tanglu distribution is now built with Debile, replacing Jenkins, which was difficult to use for package building purposes (although Jenkins is great for other things). You can see the Tanglu builders in action at buildd.tg.o.

The migration to Debile took a lot of time (a lot more than expected), and blocked the Bartholomea development at the beginning, but now it is working smoothly. Many thanks to all people who have been involved with making Debile work for Tanglu, especially Jon Severinsson. And of course many thanks to the Debile developers for helping with the integration, Sylvestre Ledru and of course Paul Tagliamonte.

Archive Server Migration

Those who read the tanglu-announce mailinglist know this already: We moved the main archive server stuff at archive.tg.o to to a new location, and to a very powerful machine. We also added some additional security measures to it, to prevent attacks.

The previous machine is now being used for the bugtracker at bugs.tg.o and for some other things, including an archive mirror and the new Tanglu User Forums. See more about that below :-)

Transitions

There is huge ongoing work on package transitions. Take a look at our transition tracker and the staging migration log to get a taste of it.

Merging with Debian Unstable is also going on right now, and we are working on merging some of the Tanglu changes which are useful for Debian as well (or which just reduce the diff to Tanglu) back to their upstream packages.

Installer

Work on the Tanglu Live-Installer, although badly needed, has not yet been started (it’s a task ready for taking by anyone who likes to do it!) – however, some awesome progress has been made in making the Debian-Installer work for Tanglu, which allows us to perform minimal installations of the Tanglu base systems and allows easier support of alternative Tanglu falvours. The work on d-i also uncovered a bug which appeared with the latest version of findutils, which has been reported upstream before Debian could run into it. This awesome progress was possible thanks to the work of Philip Muškovac and Thomas Funk (in really hard debug sessions).

Tanglu ForumsTanglu Users

We finally have the long-awaited Tanglu user forums ready! As discussed in the last meeting, a popular demand on IRC and our mailing lists was a forum or Stackexchange-like service for users to communicate, since many people can work better with that than with mailinglists.

Therefore, the new English TangluUsers forum is now ready at TangluUsers.org. The forum software is in an alpha version though, so we might experience some bugs which haven’t been uncovered in the testing period. We will watch how the software performs and then decide if we stick to it or maybe switch to another one. But so far, we are really happy with the Misago Forums, and our usage of it already led to the inclusion of some patches against Misago. It also is actively maintained and has an active community.

Misc Thingstanglu logo pure

KDE

We will ship with at least KDE Applications 4.13, maybe some 4.14 things as well (if we are lucky, since Tanglu will likely be in feature-freeze when this stuff is released). The other KDE parts will remain on their latest version from the 4.x series. For Tanglu 3, we might update KDE SC 4.x to KDE Frameworks 5 and use Plasma 5 though.

GNOME

Due to the lack manpower on the GNOME flavor, GNOME will ship in the same version available in Debian Sid – maybe with some stuff pulled from Experimental, where it makes sense. A GNOME flavor is planned to be available.

Common infrastructure

We currently run with systemd 208, but a switch to 210 is planned. Tanglu 2 also targets the X.org server in version 1.16. For more changes, stay tuned. The kernel release for Bartholomea is also not yet decided.

Artwork

Work on the default Tanglu 2 design has started as well – any artwork submissions are most welcome!

Tanglu joins the OIN

The Tanglu project is now a proud member (licensee)  of the Open Invention Network (OIN), which build a pool of defensive patents to protect the Linux ecosystem from companies who are trying to use patents against Linux. Although the Tanglu community does not fully support the generally positive stance the OIN has about software patents, the OIN effort is very useful and we agree with it’s goal. Therefore, Tanglu joined the OIN as licensee.


And that’s the stuff for now! If you have further questions, just join us on #tanglu or #tanglu-devel on Freenode, or write to our newly created forum! – You can, as always, also subscribe to our mailinglists to get in touch.

June 17, 2014

(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)

In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
  2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.

Concepts

A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:

  1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
  2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var. (We will call these stateless systems.)

A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).

Problems

Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.

Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.

To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:

  • A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances.

    To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r.

    Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary.

    Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users.

  • We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/ld.so.cache.
  • If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting.
  • We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier.

What's Next

Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:

  • When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing.
  • When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts.
  • When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc.
  • Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.

Conclusion

Before we finish, let me stress again why we are doing all this:

  1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
  2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
  7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.

Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.

Anyway, let's get the ball rolling! Late's make stateless systems a reality!

And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.

Yesterday was my first day working at Broadcom. I've taken on a new role as an open source developer there. I'm going to be working on building an MIT-licensed Mesa and kernel DRM driver for the 2708 (aka the 2835), the chip that's in the Raspberry Pi.

It's going to be a long process. What I have to work with to start is basically sample code. Talking to the engineers who wrote the code drops we've seen released from Broadcom so far, they're happy to tell me about the clever things they did (their IR is pretty cool for the target subset of their architecture they chose, and it makes instruction scheduling and register allocation *really* easy), but I've had universal encouragement so far to throw it all away and start over.

So far, I'm just beginning. I'm still working on getting a useful development environment set up and building my first bits of stub DRM code. There are a lot of open questions still as to how we'll manage the transition from having most of the graphics hardware communication managed by the VPU to having it run on the ARM (since the VPU code is a firmware blob currently, we have to be careful to figure out when it will stomp on various bits of hardware as I incrementally take over things that used to be its job).

I'll have repos up as soon as I have some code that does anything.

Overview

Pictures are the right way to start.

appgtt_concept

Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

  1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.

Architecture

The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There’s some things we can get from this for those too lazy to click on the links to the docs.

  1. PPGTT page tables exist in physical memory.
  2. PPGTT PTEs have the exact same layout as GGTT PTEs.
  3. PDEs don’t have cache attributes (more on this later).
  4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

  • A PDE is 4 bytes wide
  • A PTE is 4 bytes wide
  • A Page table occupies 4k of memory.
  • There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.

real_appgtt

An object with an aliased PPGTT mapping

Size

PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB

Location

PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.

Programming

With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

...
ret = intel_ring_begin(ring, 6);
if (ret)
	return ret;

intel_ring_emit(ring, MI_LOAD_REGISTER_IMM(2));
intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);
intel_ring_advance(ring);
...

As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.

Initialization

All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->gtt.base.mm,
					  &ppgtt->node, GEN6_PD_SIZE,
					  GEN6_PD_ALIGN, 0,
					  0, dev_priv->gtt.base.total,
					  DRM_MM_TOPDOWN);

2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
	if (!ppgtt->pt_pages[i]) {
		gen6_ppgtt_free(ppgtt);
		return -ENOMEM;
	}
}

3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {
	dma_addr_t pt_addr;

	pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,
			       PCI_DMA_BIDIRECTIONAL);
	...
}

As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.

IOMMU

As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
	ecochk |= ECOCHK_PPGTT_WB_HSW;
} else {
	ecochk |= ECOCHK_PPGTT_LLC_IVB;
	ecochk &= ~ECOCHK_PPGTT_GFDT_IVB;
}
I915_WRITE(GAM_ECOCHK, ecochk);

What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.

ggtt_flow

Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.

 

Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

  • Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
    • i915_gem_object_pin_to_display_plane(): page flipping
    • intel_setup_overlay(): overlays
  • Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
    • allocate_ring_buffer()
  • HW Contexts: Extremely similar to ringbuffer.
    • intel_alloc_context_page(): Ironlake RC6
    • i915_gem_create_context(): Create the default HW context
    • i915_gem_context_reset(): Re-pin the default HW context
    • do_switch(): Pin the logical context we’re switching to
  • Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
    • init_status_page()
  • Workarounds:
    • init_pipe_control(): Initialize scratch space for workarounds.
    • intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
    • render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
  • Other
    • i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here.
    • i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
    • i915_gem_pin_ioctl(): The DRI1 pin interface.

GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

  • PTE size increased to 8b
    • Therefore, 512 entries per table
    • Format mimics the CPU PTEs
  • PDEs increased to 8b (remains 512 PDEs per PD)
    • Page Directories live in system memory
      • GGTT no longer holds the PDEs.
    • There are 4 PDPs, and therefore 4 PDs
    • PDEs are cached in LLC instead of special cache (I’m guessing)
  • New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
    • PP_DIR_BASE, and PP_DCLV are removed
  • Support for 4 level page tables, up to 48b virtual address space.
    • PML4[PML4E]->PDP
    • PDP[PDPE] -> PD
    • PD[PDE] -> PT
    • PT{PTE] -> Memory
  • Big pages are now 64k instead of 32k (still not implemented)
  • New caching interface via PAT like structure

Summary

There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.

Recapping:

  • The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
  • Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
  • GEN8 changed a lot of architectural stuff.
  • The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.
https://bwidawsk.net/blog/wp-content/uploads/2014/06/appgtt_concept.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/real_ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/ggtt_flow.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/ppgtt_flow.svg

Download PDF

  1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT 

  2. Page walk, Two-Level Per-Process Virtual Memory 

  3. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority. 

  4. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches 

  5. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry( 

June 11, 2014
So over the past few years the drm subsystem gained some very nice documentation. And recently we've started to follow suite with the Intel graphics driver. All the kernel documenation is integrated into one big DocBook and I regularly upload latest HTML builds of the Linux DRM Developer's Guide. This is built from drm-intel-nightly so has slightly freshed documentation (hopefully) than the usual DocBook builds from Linus' main branch which can be found all over the place. If you want to build these yourself simply run

$ make htmldocs

For testing we now also have neat documentation for the infrastructure and helper libraries found in intel-gpu-tools. The README in the i-g-t repository has detailed build instructions - gtkdoc is a bit more of a fuzz to integrate.

Below the break some more details about documentation requirements relevant for developers.

So from now on I expect reasonable documentation for new, big kernel features and for new additions to the i-g-t library.

For i-g-t the process is simple: Add the gtk-doc comment blocks to all newly added functions, install and build with gtk-doc enabled. Done. If the new library is tricky (for example the pipe CRC support code) a short overview section that references some functions to get people started is useful, but not really required. And with the exception of the still in-flux kernel modesetting helper library i-g-t is fully documented, so there's lots of examples to copy from.

For the kernel this is a bit more involved, mostly since kerneldoc sucks more. But we also only just started with documenting the drm/i915 driver itself.
  1. First extract all the code for your new feature into a new file. There's unfortunately no other way to sensibly split up and group the reference documentation with kerneldoc. But at least that will also be a good excuse to review the related interfaces before extracting them.
  2. Create reference kerneldoc comments for the functions used as interfaces to the rest of the driver. It's always a bit a judgement call what to document and what not, since compared to the DRM core where functions must be explicitly exported to drivers there's no clean separate between the core parts and subsystems and more mundane platform enabling code. For big and complicated features it's also good practice to have an overview DOC: section somewhere at the beginning of the file.
  3. Note that kerneldoc doesn't have support for markdown syntax (or anything else like that) and doesn't do automatic cross-referencing like gtk-doc. So if you documentation absolutely needs a table or a list you have to do it twice unfortunately: Once as a plain code comment and once as a DocBook marked-up table or list. Long-term we want to improve the kerneldoc markup support, but for now we have to deal with what we have.
  4. As with all documentation don't document the details of the implementation - otherwise it will get stale fast because comments are often overlooked when updating code.
  5. Integrate the new kerneldoc section into the overall DRM DocBook template. Note that you can't go deeper than a section2 nesting for otherwise the reference documentation won't be lists, and due to the lack of any autogenerated cross-links inaccessible and useless. Build the html docs to check that your overview summary and reference sections have all been pulled in and that the kerneldoc parser is happy with your comments.
A really nice example for how to do this all is the documentation for the gen7 cmd parser in i915_cmd_parser.c.
June 10, 2014

videotape

Introduction

Gobi chipsets are mobile broadband modems developed by Qualcomm, and they are nowadays used by lots of different manufacturers, including Sierra Wireless, ZTE, Huawei and of course Qualcomm themselves.

These devices will usually expose several interfaces in the USB layer, and each interface will then be published to userspace as different ‘ports’ (not the correct name, but I guess easier to understand). Some of the interfaces wil give access to serial ports (e.g. ttys) in the modem, which will let users execute standard connection procedures using the AT protocol and a PPP session. The main problem with using a PPP session over a serial port is that it makes it very difficult, if not totally impossible, to handle datarates above 3G, like LTE. So, in addition to these serial ports, Gobi modems also provide access to a control port (speaking the QMI protocol) and a network interface (think of it as a standard ethernet interface). The connection procedure then can be executed purely through QMI (e.g. providing APN, authentication…) and then userspace can use a much more convenient network interface for the real data communication.

For a long time, the only way to use such QMI+net pair in the Linux kernel was to use the out-of-tree GobiNet drivers provided by Qualcomm or by other manufacturers, along with user-space tools also developed by them (some of them free/open, some of them proprietary). Luckily, a couple of years ago a new qmi_wwan driver was developed by Bjørn Mork and merged into the upstream kernel. This new driver provided access to both the QMI port and the network interface, but was much simpler than the original GobiNet one. The scope was reduced so much, that most of the work that the GobiNet driver was doing in kernel-space, now it had to be done by userspace applications. There are now at least 3 different user-space implementations allowing to use QMI devices through the qmi_wwan port: ofono, uqmi and of course, libqmi.

The question, though, still remains. What should I use? The upstream qmi_wwan kernel driver and user-space utilities like libqmi? Or rather, the out-of-tree GobiNet driver and user-space utilities provided by manufacturers? I’m probably totally biased, but I’ll try to compare the two approaches by pointing out their main differences.

Note: you may want to read the ‘Introduction to libqmi‘ post I wrote a while ago first.

in-tree vs out-of-tree

The qmi_wwan driver is maintained within the upstream Linux kernel (in-tree). This, alone, is a huge advantage compared to GobiNet. Kernel updates may modify the internal interfaces they expose for the different drivers, and being within the same sources as all the other ones, the qmi_wwan driver will also get those updates without further effort. Whenever you install a kernel, you know you’ll have the qmi_wwan driver applicable to that same kernel version ready, so its use is very straightforward. The qmi_wwan driver also contains support for Gobi-based devices from all vendors, so regardless of whether you have a Sierra Wireless modem or a Huawei one (just to name a few), the driver will be able to make your device work as expected in the kernel.

GobiNet is a whole different story. There is not just one GobiNet: each manufacturer keeps its own. If you’re using a Sierra Wireless device you’ll likely want to use the GobiNet driver maintained by them, so that for example, the specific VID/PID pairs are already included in the driver; or going a bit deeper, so that the driver knows which is supposed to be the QMI/WWAN interface number that should be used (different vendors have different USB interface layouts). In addition to the problem of requiring to look for the GobiNet driver most suitable for your device, having the drivers maintained out-of-tree means that they need to provide a single set of sources for a very long set of kernel versions. The sources, therefore, are full of #ifdefs enabling/disabling different code paths depending on the kernel version targeted, so maintaining it gets to be much more complicated than if they just had it in-tree.

Note: Interestingly, we’ve already seen fixes that were first implemented in qmi_wwan ‘ported’ to GobiNet variants.

Complexity

The qmi_wwan driver is simple; it will just get a USB interface and split it into a QMI-capable /dev/cdc-wdm port (through the cdc-wdm driver) and a wwan network interface. As the kernel only provides basic transport to and from the device, it is left to user-space the need to manage the QMI protocol completely, including service client allocations/releases as well as the whole internal CTL service. Note, though, that this is not a problem; user-space tools like libqmi will do this work nicely.

The GobiNet driver is instead very complex. The driver also exposes a control interface (e.g. /dev/qcqmi) and a network interface, but all the work that is done through the internal CTL service is done at kernel-level. So all client allocations/releases for the different services are actually performed internally, not exposed to user-space. Users will just be able to request client allocations via ioctl() calls, and client releases will be automatically managed within the kernel. In general, it is never advisable to have such a complex driver. As complexity of a driver increases, so does the likelyhood of having errors, and crashes in a driver could affect the whole kernel. Quoting Bjørn, the smaller the device driver is, the more robust the system is.

Note: Some Android devices also support QMI-capable chipsets through GobiNet (everything hidden in the kernel and the RIL). In this case, though, you may see that shared memory can also be used to talk to the QMI device, instead of a /dev/qcqmi port.

Device initialization

One of the first tasks that is done while communicating with the Gobi device is to set it up (e.g. decide which link-layer protocol to use in the network interface) and make sure that the modem is ready to talk QMI. In the case of the GobiNet driver, this is all done in kernel-space; while in the case of qmi_wwan everything can be managed in user-space. The libqmi library allows several actions to be performed during device initialization, including the setting of the link-layer protocol to use. There are, for example, models from Sierra Wireless (like the new MC7305) which expose by default one QMI+network interface (#8) configured to use 802.3 (ethernet headers) and another QMI+network interface (#10) configured to use raw IP (no ethernet headers). With libqmi, we can switch the second one to use 802.3, which is what qmi_wwan expects, thus allowing us to use both QMI+net pairs at the same time.

Multiple processes talking QMI

One of the problems of qmi_wwan is that only one process is capable of using the control port at a given time. The GobiNet driver, instead, allows multiple processes to concurrently access the device, as each process would get assigned different QMI clients with different client IDs directly from the kernel, hence, not interfering with each other. In order to handle this issue, libqmi (since version 1.8) was extended to implement a ‘qmi-proxy’ process which would be the only one accessing the QMI port, but which would allow different process to communicate with the device concurrently (by sharing and synchronizing the CTL service among the connected peers).

User-space libraries

The GobiNet driver is designed to be used along with Qualcomm’s C++ GobiAPI library in user-space. On top of this library, other manufacturers (like Sierra Wireless) provide additional libraries to use specific features of their devices. This GobiAPI library will handle itself all the ioctl() calls required to e.g. allocate new clients, and will also provide a high level API to access the different QMI services and operations in the device.

In the case of the qmi_wwan driver, as already said, there are several implementations which will let you talk QMI with the device. libqmi, which I maintain, is one of them. libqmi provides a GLib-based C library, and therefore it exposes objects and interfaces which provide access to the most used QMI services in any kind of device. The CTL service, the internal one which was managed in the kernel by GobiNet, will be managed internally by libqmi and therefore mostly hidden to the users of the library.

Note: It is not (yet) possible to mix GobiAPI with qmi_wwan and e.g. libqmi with GobiNet. Therefore, it is not (yet) possible to use libqmi or qmicli in e.g. an Android device with a QMI-capable chipset.

User-space command line tools

I am no really aware of any general purpose command line tool developed to be used with the GobiNet driver (well, firmware loader applications, but those are not general purpose). The lack of command line tools may be likely due to the fact that, as QMI clients are released automatically by the GobiNet kernel, it is not easy (if at all possible) to leave a QMI client allocated and re-use it over and over by a command line tool which executes an action and exits.

With qmi_wwan, though, as clients are not automatically released, command line tools are much easier to handle. The libqmi project includes a qmicli tool which is able to execute independent QMI requests in each run of the program, even re-using the same QMI client in each of the runs if needed. This is especially important when launching a connection, as the WDS client which executes the “Start Network” command must be kept registered as long as the connection is open, or otherwise the connection will get dropped.

New firmware loading

The process of loading new firmware into a QMI-based device is not straightforward. It involves several interactions at QMI-level, plus a QDL based download of the firware to the device (kind of what gobi_loader does for Gobi 2K). Sadly, there is not yet a way to perform this operation when using qmi_wwan and its user-space tools. If you’re in the need of updating the firmware of the device, the only choice left is to use the GobiNet driver plus the vendor-provided programs.

Support

One of the advantages of the GobiNet driver is that every manufacturer will (should) give direct support for their devices if that kernel driver is used. Actually, there are vendors which will only give support for the hardware if their driver is the one in use. I’m therefore assuming that GobiNet may be a good choice for companies if they want to rely in the vendor-provided support, but likely not for standard users which just happen to have a device of this kind in their systems.

But, even if it is not the official support, you can anyway still get in touch with the libqmi mailing list if you’re experiencing issues with your QMI device; or contact companies or individuals (e.g. me!) which provide commercial support for the qmi_wwan driver and libqmi/qmicli integration needs.


Filed under: FreeDesktop Planet, GNOME Planet, GNU Planet, Planets Tagged: Gobi, GobiNet, libqmi, linux, QMI

Two months ago in April ’14 I’ve been in San Francisco to meet with other FOSS developers and discuss current projects. There were several events, including the first GNOME Westcoast Summit and a systemd hackfest at Pantheon. I’ve been working on a lot of stuff lately and it was nice to talk directly to others about it. I wrote in-depth articles (on this blog) for the most interesting stories, but below is a short overview of what I focused on in SF:

  • memfd: My most important project currently is memfd. We fixed several bugs and nailed down the API. It was also nice to get feedback from a lot of different projects about interesting use-cases that we didn’t think of initially. As it turns out, file-sealing is something a lot of people can make great use of.
  • MiracleCast: For about half a year I worked on the first Open-Source implementation of Miracast. It’s still under development and only working Sink-side, but there are plans to make it work Client-side, too. Miracast allows to replace HDMI cables with wireless-solutions. You can connect your monitor, TV or projector via standard wifi to your desktop and use it as mirror or desktop-extension. The monitor is sink-side and MiracleCast can already provide a full Miracast stack for it. However, for the more interesting Source-side (eg., a Gnome-Desktop) I had a lot of interesting discussions with Gnome developers how to integrate it. I have some prototypes running locally, but it will definitely take a lot longer before it works properly. However, the current sink-side implementation has a latency of approx. 50ms and can run 30fps 1080p. This is already pretty impressive and on-par with proprietary solutions.
  • kdbus: The new general-purpose IPC mechanism is already fleshed out, but we spent a lot of time fixing races in the code and doing some general code review. It is a very promising project and all of the criticism I’ve heard so far was rubbish. People tend to rant about moving dbus in the kernel, even though kdbus really has nearly nothing to do with dbus, except that it provides an underlying data-bus infrastructure. Seriously, the helpers used for kernel-mode-setting without including the driver-specific code is already much bigger than kdbus… and in my opinion, kdbus will make dbus a lot more efficient and appealing to new developers.
  • GPU: GPU-switching, offload-GPUs and USB/wifi display-controllers are few of the many new features in the graphics subsystem. They’re mostly unsupported in any user-space, so we decided to change that. It’s all highly technical and the way how it is supposed to work is fairly obvious. Therefore, I will avoid discussing the details here. Lets just say, on-demand and live GPU-switching is something I’m making possible as part of GSoC this summer.
  • User-bus: This topic sounds fairly boring and technical, but it’s not. The underlying question is: What happens if you log in multiple times as the same user on the same system? Currently, a desktop system either rejects multiple logins of the same user or treats them as separate, independent logins. The second approach has the problem that many applications cannot deal with this. Many per-user resources have to be shared (like the home-directory). Firefox, for instance, cannot run multiple times for the same user. However, no-one wants to prevent multiple logins of the same user, as it really is a nice feature. Therefore, we came up with a hybrid aproach which basically boils down to a single session shared across all logins of the same user. So if you login twice, you get the same screen for both logins sharing the same applications. The window-manager can put you on a separate virtual desktop, but the underlying session is basically the same. Now if you do the same across multiple seats, you simply merge both sessions of these seats into a single huge session with the screen mirrored across all assigned monitors. A more in-depth article will follow once the details have been figured out.

A lot of the things I worked on deal with the low-level system and are hardly visible to the average Gnome user. However, without a proper system API, there’s no Gnome and I’m very happy the Gnome Foundation is acknowledging this by sponsoring my trip to SF: Thanks a lot! And hopefully I’ll see you again next year!


For 4 months now we’ve been hacking on a new syscall for the linux-kernel, called memfd_create. The intention is to provide an easy way to get a file-descriptor for anonymous memory, without requiring a local tmpfs mount-point. The syscall takes 2 parameters, a name and a bunch of flags (which I will not discuss here):

int memfd_create(const char *name, unsigned int flags);

If successful, a new file-descriptor pointing to a freshly allocated memory-backed file is returned. That file is a regular file in a kernel-internal filesystem. Therefore, most filesystem operations are supported, including:

  • ftruncate(2) to change the file size
  • read(2), write(2) and all its derivatives to inspect/modify file contents
  • mmap(2) to get a direct memory-mapping
  • dup(2) to duplicate file-descriptors

Theoretically, you could achieve similar behavior without introducing new syscalls, like this:

int fd = open("/tmp/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
unlink("/tmp/random_file_name");

or this

int fd = shm_open("/random_file_name", O_RDWR | O_CREAT | O_EXCL, S_IRWXU);
shm_unlink("/random_file_name");

or this

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

Therefore, the most important question is why the hell do we need a third way?

Two crucial differences are:

  • memfd_create does not require a local mount-point. It can create objects that are not associated with any filesystem and can never be linked into a filesystem. The backing memory is anonymous memory as if malloc(3) had returned a file-descriptor instead of a pointer. Note that even shm_open(3) requires /dev/shm to be a tmpfs-mount. Furthermore, the backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.
  • There are no name-clashes and no global registry. You can create multiple files with the same name and they will all be separate, independent files. Therefore, the name is purely for debugging purposes so it can be detected in task-dumps or the like.

To be honest, the code required for memfd_create is 100 lines. It didn’t take us 2 months to write these, but instead we added one more feature to memfd_create called Sealing:

File-Sealing

File-Sealing is used to prevent a specific set of operations on a file. For example, after you wrote data into a file you can seal it against further writes. Any attempt to write to the file will fail with EPERM. Reading will still be possible, though. The crux of this matter is that seals can never be removed, only added. This guarantees that if a specific seal is set, the information that is protected by that seal is immutable until the object is destroyed.

To retrieve the current set of seals on a file, you use fcntl(2):

int seals = fcntl(fd, F_GET_SEALS);

This returns a signed 32bit integer containing the bitmask of currently set seals on fd. Note that seals are per file, not per file-descriptor (nor per file-description). That means, any file-descriptor for the same underlying inode will share the same seals.

To seal a file, you use fcntl(2) again:

int error = fcntl(fd, F_ADD_SEALS, new_seals);

This takes a bitmask of seals in new_seals and adds these to the current set of seals on fd.

The current set of supported seals is:

  • F_SEAL_SEAL: This seal prevents the seal-operation itself. So once F_SEAL_SEAL is set, any attempt to add new seals via F_ADD_SEALS will fail. Files that don’t support sealing are initially sealed with just this flag. Hence, no other seals can ever be set and thus do not have to be enforced.
  • F_SEAL_WRITE: This is the most straightforward seal. It prevents any content modifications once it is set. Any write(2) call will fail and you cannot get any shared, writable mappings for the file, anymore. Unlike the other seals, you can only set this seal if no shared, writable mappings exist at the time of sealing.
  • F_SEAL_SHRINK: Once set, the file cannot be reduced in size. This means, O_TRUNC, ftruncate(), fallocate(FALLOC_FL_PUNCH_HOLE) and friends will be rejected in case they would shrink the file.
  • F_SEAL_GROW: Once set, the file size cannot be increased. Any write(2) beyond file-boundaries, any ftruncate(2) that increases the file size, and any similar operation that grows the file will be rejected.

Instead of discussing the behavior of each seal on its own, the following list shows some examples how they can be used. Note that most seals are enforced somewhere low-level in the kernel, instead of directly in the syscall handlers. Therefore, side effects of syscalls I didn’t cover here are still accounted for and the syscalls will fail if they violate any seals.

  • IPC: Imagine you want to pass data between two processes that do not trust each other. That is, there is no hierarchy at all between them and they operate on the same level. The easiest way to achieve this is a pipe, obviously. However, to allow zero-copy (assuming splice(2) is not possible) the processes might decide to use memfd_create to create a shared memory object and pass the file-descriptor to the remote process. Now zero-copy only makes sense if the receiver can parse the data in-line. However, this is not possible in zero-trust scenarios as the source can retain a file-descriptor and modify the contents while the receiver parses it, causing any kinds of failure. But if the receiver requires the object to be sealed with F_SEAL_WRITE | F_SEAL_SHRINK, it can safely mmap(2) the file and parse it inline. No attacker can alter file contents, anymore. Furthermore, this also allows safe mutlicasts of the message and all receivers can parse the same zero-copy file without affecting each other. Obviously, the file can never be modified again and is a one-shot object. But this is inherent to zero-trust scenarios. We did implement a recycle-operation in case you’re the last user of an object. However, that was dropped due to horrible races in the kernel. It might reoccur in the future, though.
  • Graphics-Servers: This is a very specific use-case of IPC and usually there is a one-way trust relationship from clients to servers. However, a server cannot blindly trust its clients. So imagine a client renders its window-contents into memory and passes a file-descriptor to that memory region (maybe using memfd_create) to the server. Similar to the previous scenario, the server cannot mmap(2) that object for read-access as the client might truncate the file simultaneously, causing SIGBUS on the server. A server can protect itself via SIGBUS-handlers, but sealing is a much simpler way. By requiring F_SEAL_SHRINK, the server can be sure, the file will never shrink. At the same time, the client can still grow the object in case it needs bigger buffers for growing windows. Furthermore, writing is still allowed so the object can be re-used for the next frame.

As you might imagine, there are a lot more interesting use-cases. However, note that sealing is currently limited to objects created via memfd_create with the MFD_ALLOW_SEALING flag. This is a precaution to make sure we don’t break existing setups. However, changing seals of a file requires WRITE-access, thus it is rather unlikely that sealing would allow attacks that are not already possible with mandatory POSIX locks or similar. Hence, it is possible that sealing will expand to other areas in case people request it. Further seal-types are also possible.

Current Status

As of June 2014 the patches for memfd_create and sealing have been publicly available for at least 2 months and are considered for merging. linux-3.16 will probably not include it, but linux-3.17 very likely will. Currently, there’s still some issues to be figured out regarding AIO and Direct-IO races. But other than that, we’re good to go.


Linus decided to have a bit fun with the 3.16 merge window and the 3.15 release, so I'm a bit late with our regular look at the new stuff for the Intel graphics driver.
First things first, Baytrail/Valleyview has finally gained support for MIPI DSI panels! Which means no more ugly hacks to get machines like the ASUS T100 going for users and no more promises we can't keep from developers - it landed for real this time around. Baytrail has also seen a lot of polish work in e.g. the infoframe handling, power domain reset, ...

Continuing on the new hardware platform this release features the first version of our prelimary support for Cherryview. At a very high level this combines a Gen8 render unit derived from Broadwell with a beefed-up Valleyview display block. So a lot of the enabling work boiled down to wiring up existing code, but of course there's also tons of new code to get all the details rights. Most of the work has been done by Ville and Chon Ming Lee with lots of help from other people.

Our modeset code has also seen lots of improvements. The user-visible feature is surely support for large cursors. On high-dpi panels 64x64 simply doesn't cut it and the kernel (and latest SNA DDX) now support up to the hardware limit of 256x256. But there's also been a lot of improvements under the hood: More of Ville's infrastructure for atomic pageflips has been merged - slowly all the required pieces like unified plane updates for modeset, two stage watermark updates or atomic sprite updates are falling into place. Still a lot of work left to do though. And the modesetting infrasfrastucture has also seen a bit of work by the almost complete removal of the ->mode_set hooks. We need that for both atomic modeset updates and for proper runtime PM support.

On that topic: Runtime power management is now enabled for a bunch of our recent platforms - all the prep work from Paulo Zanoni and Imre Deak in the past few releases has finally paid off. There's still leftovers to be picked up over the coming releases like proper runtime PM support for DPMS on all platforms, addressing a bunch of crazy corner cases, rolling it out on the newer platforms like Cherryview or Broadwell and cleaning the code up a bit. But overall we're now ready for what the marketing people call "connected standy", which means that power consumption with all devices turned off through runtime pm should be as low as when doing a full system suspend. It crucially relies upon userspace not sucking and waking the cpu and devices up all the time, so personally I'm not sure how well this will work out really.

Another piece for proper atomic pageflip support is the universal primary plane support from Matt Roper. Based upon his DRM core work in 3.15 he now enabled the universal primary plane support in i915 properly. Unfortunately the corresponding patches for cursor support missed 3.16. The universal plane support is hence still disabled by default. For other atomic modeset work a shout-out goes to Rob Clark who's locking conversion to wait/wound mutexes for modeset objects has been merged.

On the GEM side Chris Wilson massively improved our OOM handling. We are now much better at surviving a crash against the memory brickwall. And if we don't and indeed run out of memory we have much better data to diagnose the reason for the OOM. The top-down PDE allocator from Ben Widawsky better segregates our usage of the GTT and is one of the pieces required before we can enable full ppgtt for production use. And the command parser from Brad Volkin is required for some OpenGL and OpenCL features on Haswell. The parser itself is fully merged and ready, but the actual batch buffer copying to a secure location missed the merge window and hence it's not yet enabled in permission granting mode.

The big feature to pop the champagne though is the userptr support from Chris - after years I've finally run out of things to complain about and merged it. This allows userspace to wrap up any memory allocations obtained by malloc() (or anything else backed by normal pages) into a GEM buffer object. Useful for faster uploads and downloads in lots of situation and currently used by the DDX to wrap X shmem segments. But OpenCL also wants to use this.

We've also enabled a few Broadwell features this time around: eDRAM support from Ben, VEBOX2 support from Zhao Yakui and gpu turbo support from Ben and Deepak S.

And finally there's the usual set of improvements and polish all over the place: GPU reset improvements on gen4 from Ville, prep work for DRRS (dynamic refresh rate switching) from Vandana, tons of interrupt and especially vblank handling rework (from Paulo and Ville) and lots of other things.

In Solaris 11.1, I updated the system headers to enable use of several attributes on functions, including noreturn and printf format, to give compilers and static analyzers more information about how they are used to give better warnings when building code.

In Solaris 11.2, I've gone back in and added one more attribute to a number of functions in the system headers: __attribute__((__deprecated__)). This is used to warn people building software that they’re using function calls we recommend no longer be used. While in many cases the Solaris Binary Compatibility Guarantee means we won't ever remove these functions from the system libraries, we still want to discourage their use.

I made passes through both the POSIX and C standards, and some of the Solaris architecture review cases to come up with an initial list which the Solaris architecture review committee accepted to start with. This set is by no means a complete list of Obsolete function interfaces, but should be a reasonable start at functions that are well documented as deprecated and seem useful to warn developers away from. More functions may be flagged in the future as they get deprecated, or if further passes are made through our existing deprecated functions to flag more of them.

Header Interface Deprecated by Alternative Documented in
<door.h> door_cred(3C) PSARC/2002/188 door_ucred(3C) door_cred(3C)
<kvm.h> kvm_read(3KVM), kvm_write(3KVM) PSARC/1995/186 Functions on kvm_kread(3KVM) man page kvm_read(3KVM)
<stdio.h> gets(3C) ISO C99 TC3 (Removed in ISO C11), POSIX:2008/XPG7/Unix08 fgets(3C) gets(3C) man page, and just about every gets(3C) reference online from the past 25 years, since the Morris worm proved bad things happen when it’s used.
<unistd.h> vfork(2) PSARC/2004/760, POSIX:2001/XPG6/Unix03 (Removed in POSIX:2008/XPG7/Unix08) posix_spawn(3C) vfork(2) man page.
<utmp.h> All functions from getutent(3C) man page PSARC/1999/103 utmpx functions from getutentx(3C) man page getutent(3C) man page
<varargs.h> varargs.h version of va_list typedef ANSI/ISO C89 standard <stdarg.h> varargs(3EXT)
<volmgt.h> All functions PSARC/2005/672 hal(5) API volmgt_check(3VOLMGT), etc.
<sys/nvpair.h> nvlist_add_boolean(3NVPAIR), nvlist_lookup_boolean(3NVPAIR) PSARC/2003/587 nvlist_add_boolean_value, nvlist_lookup_boolean_value nvlist_add_boolean(3NVPAIR) & (9F), nvlist_lookup_boolean(3NVPAIR) & (9F).
<sys/processor.h> gethomelgroup(3C) PSARC/2003/034 lgrp_home(3LGRP) gethomelgroup(3C)
<sys/stat_impl.h> _fxstat, _xstat, _lxstat, _xmknod PSARC/2009/657 stat(2) old functions are undocumented remains of SVR3/COFF compatibility support

If the above table is cut off when viewing in the blog, try viewing this standalone copy of the table.

To See or Not To See

To see these warnings, you will need to be building with either gcc (versions 3.4, 4.5, 4.7, & 4.8 are available in the 11.2 package repo), or with Oracle Solaris Studio 12.4 or later (which like Solaris 11.2, is currently in beta testing). For instance, take this oversimplified (and obviously buggy) implementation of the cat command:

#include <stdio.h>

int main(int argc, char **argv) {
    char buf[80];

    while (gets(buf) != NULL)
	puts(buf);
    return 0;
}
Compiling it with the Studio 12.4 beta compiler will produce warnings such as:
% cc -V
cc: Sun C 5.13 SunOS_i386 Beta 2014/03/11
% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221

The exact warning given varies by compilers, and the compilers also have a variety of flags to either raise the warnings to errors, or silence them. Of couse, the exact form of the output is Not An Interface that can be relied on for automated parsing, just shown for example.

gets(3C) is actually a special case — as noted above, it is no longer part of the C Standard Library in the C11 standard, so when compiling in C11 mode (i.e. when __STDC_VERSION__ >= 201112L), the <stdio.h> header will not provide a prototype for it, causing the compiler to complain it is unknown:

% gcc -std=c11 gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: implicit declaration of function ‘gets’ [-Wimplicit-function-declaration]
     while (gets(buf) != NULL)
     ^
The gets(3C) function of course is still in libc, so if you ignore the error or provide your own prototype, you can still build code that calls it, you just have to acknowledge you’re taking on the risk of doing so yourself.

Solaris Studio 12.4 Beta

% cc gets_test.c
"gets_test.c", line 6: warning:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221

% cc -errwarn=E_DEPRECATED_ATT gets_test.c
"gets_test.c", line 6:  "gets" is deprecated, declared in : "/usr/include/iso/stdio_iso.h", line 221
cc: acomp failed for gets_test.c
This warning is silenced in the 12.4 beta by cc -erroff=E_DEPRECATED_ATT
No warning is currently issued by Studio 12.3 & earler releases.

gcc 3.4.3

% /usr/sfw/bin/gcc gets_test.c
gets_test.c: In function `main':
gets_test.c:6: warning: `gets' is deprecated (declared at /usr/include/iso/stdio_iso.h:221)

Warning is completely silenced with gcc -Wno-deprecated-declarations

gcc 4.7.3

% /usr/gcc/4.7/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]

% /usr/gcc/4.7/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

gcc 4.8.2

% /usr/bin/gcc gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: warning: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Wdeprecated-declarations]
     while (gets(buf) != NULL)
     ^

% /usr/bin/gcc -Werror=deprecated-declarations gets_test.c
gets_test.c: In function ‘main’:
gets_test.c:6:5: error: ‘gets’ is deprecated (declared at /usr/include/iso/stdio_iso.h:221) [-Werror=deprecated-declarations]
     while (gets(buf) != NULL)
     ^
cc1: some warnings being treated as errors

Warning is completely silenced with gcc -Wno-deprecated-declarations

Global Graphics Translation Tables

Here goes the basics of how the GEN GPU interacts with memory. It will be focused on the lowest levels of the i915 driver, and the hardware interaction. My hope is that by going through this in excruciating detail, I might be able to take more liberties in the future posts.

What are the Global Graphics Translation Table

The graphics translation tables provide the address mapping from the GPU’s virtual address space to a physical address1. The GTT is somewhat a relic of the AGP days ( GART) with the distinction being that the GTT as it pertains to Intel GEN GPUs has logic that is contained within the GPU, and does not act as a platform IOMMU. I believe (and wikipedia seems to agree) that GTT and GART were used interchangeably in the AGP days.

GGTT architecture

Each element within the GTT is an entry, and the initialism for each entry is a, “PTE” or page table entry. Much of the required initialization is handled by the boot firmware. The i915 driver will get any required information from the initialization process via PCI config space, or MMIO.

Intel/GEN UMA system

Example illustrating Intel/GEN memory organization:

Location

The table is located within system memory, and is allocated for us by the BIOS or boot firmware. To clarify the docs a bit, GSM is the portion of stolen memory for the GTT, DSM is the rest of stolen memory used for misc things. DSM is the stolen memory referred to by the current i915 code as “stolen memory.” In theory we can get the location of the GTT from MMIO MPGFXTRK_CR_MBGSM_0_2_0_GTTMMADR (0×108100, 31:20), but we do not do that. The register space, and the GTT entries are both accessible within BAR0 (GTTMMADR).

All the information can be found in Volume 12, p.129: UNCORE_CR_GTTMMADR_0_2_0_PCI. Quoting directly from the HSW spec, “The range requires 4 MB combined for MMIO and Global GTT aperture, with 2MB of that used by MMIO and 2MB used by GTT. GTTADR will begin at GTTMMADR 2 MB while the MMIO base address will be the same as GTTMMADR.”

In the below code you can see we take the address in the PCI BAR and add half the length to the base. For all modern GENs, this is how things are split in the BAR.

/* For Modern GENs the PTEs and register space are split in the BAR */
gtt_phys_addr = pci_resource_start(dev->pdev, 0) +
	(pci_resource_len(dev->pdev, 0) / 2);

dev_priv->gtt.gsm = ioremap_wc(gtt_phys_addr, gtt_size);

One important thing to notice above is that the PTEs are mapped in a write-combined fashion. Write combining makes sequential updates (something which is very common when mapping objects) significantly faster. Also, the observant reader might ask, ‘why go through the BAR to update the PTEs if we have the actual physical memory location.’ This is the only way we have to make sure the GPUs TLBs get synchronized properly on PTE updates. If this weren’t required, a nice optimization might be to update all the entries as once with the CPU, and then go tell the GPU to invalidate the TLBs.

Size

Size is a bit more straight forward. We just read the relevant PCI offset. In the docs: p.151 GSA_CR_MGGC0_0_2_0_PCI offset 0×50, bits 9:8

And the code is even more straightforward.

static inline unsigned int gen6_get_total_gtt_size(u16 snb_gmch_ctl)
{
        snb_gmch_ctl >>= SNB_GMCH_GGMS_SHIFT;
        snb_gmch_ctl &= SNB_GMCH_GGMS_MASK;
        return snb_gmch_ctl << 20;
}
pci_read_config_word(dev->pdev, SNB_GMCH_CTRL, &snb_gmch_ctl);
gtt_size = gen6_get_total_gtt_size(snb_gmch_ctl);
gtt_total = (gtt_size / sizeof(gen6_gtt_pte_t)) << PAGE_SHIFT;

Layout

The PTE layout is defined by the PRM and as an example, can be found on page 35 of HSW – Volume 5: Memory Views. For convenience, I have reconstructed the important part here:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:322 Cacheability Control[2:0] Valid

The valid bit is always set for all GGTT PTEs. The programming notes tell us to do this (also on page 35 of HSW – Volume 5: Memory Views)3.

Putting it together

As a result, of what we’ve just learned, we can make up a function to write the PTEs.:

/**
 * gen_write_pte() - Write a PTE entry
 * @dev_priv:	The driver private structure
 * @address:	The physical address to back the graphics VA
 * @entry:	Which PTE in the table to update
 * @cache_type: Preformatted cache type. Varies by platform
 */
static void
gen_write_pte(dev_priv, phys_addr_t address,
	      unsigned int entry, uint32_t cache_type)
{
	/* Total size, divided by the PTE size is the max entry */
	BUG_ON(entry >= (gtt_total / 4);
	/* We can only use 38 address bits */
	BUG_ON(address >= (1<<39); 

	uint32_t pte = lower_32_bits(address) |
		       (upper_32_bits(address) << 4) |
		       cache_type |
		       1;
	iowrite32(pte, dev_priv->gtt.gsm + (entry * 4));
}

Example

Let’s analyze a real HSW running something. We can do this with the tool in the intel-gpu-tools suite, intel_gtt, passing it the -d option4.

GTT offset |                 PTEs
--------------------------------------------------------
  0x000000 | 0x0ee23025 0x0ee28025 0x0ee29025 0x0ee2a025
  0x004000 | 0x0ee2b025 0x0ee2c025 0x0ee2d025 0x0ee2e025
  0x008000 | 0x0ee2f025 0x0ee30025 0x0ee31025 0x0ee32025
  0x00c000 | 0x0ee33025 0x0ee34025 0x0ee35025 0x0ee36025
  0x010000 | 0x0ee37025 0x0ee13025 0x0ee1a025 0x0ee1b025
  0x014000 | 0x0ee1c025 0x0ee1d025 0x0ee1e025 0x0ee1f025
  0x018000 | 0x0ee80025 0x0ee81025 0x0ee82025 0x0ee83025
  0x01c000 | 0x0ee84025 0x0ee85025 0x0ee86025 0x0ee87025

And just to continue beating the dead horse, let’s breakout the first PTE:

31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid
0xee23000 0 0×2 0×2 1

Physical address: 0x20ee23000
Cache type: 0x2 (WB in LLC Only – Aged "3")
Valid: yes

Definition of a GEM BO

We refer to virtually contiguous locations which are mapped to specific graphics operands as one of, objects, buffer objects, BOs, or GEM BOs.

In the i915 driver, the verb, “bind” is used to describe the action of making a GPU virtual address range point to the valid backing pages of a buffer object.5 The driver also reuses the verb, “pin” from the Linux mm, to mean, prevent the object from being unbound.

bo_mapped

Example of  a “bound” GPU buffer

Scratch Page

We’ve already talked about the scratch page twice, albeit briefly. There was an indirect mention, and of course in the image directly above. The scratch page is a single page allocated from memory which every unused GGTT PTE will point to.

To the best of my knowledge, the docs have never given a concrete explanation for the necessity of this, however one might assume unintentional  behavior should the GPU talk a page fault. One would be right to interject at this point with the fact that by the very nature of DRI drivers, userspace can almost certainly find a way to hang the GPU. Why should we bother to protect them against this particular issue? Given that the GPU has undefined (read: Not part of the behavioral specification) prefecthing behavior, we cannot guarantee that even a well behaved userspace won’t invoke page faults6. Correction: after writing this, I went and looked at the docs. They do explain exactly which engines can, and cannot take faults. The “why” seems to be missing however.

Mappings and the aperture

The Aperture

First we need to take a bit of a diversion away from GEN graphics (which to repeat myself, are all of the shared memory type). If one thinks of traditional discrete graphics devices, there is always embedded GPU memory. This poses somewhat of an issue given that all end user applications require the CPU to run. The CPU still dispatches work to the GPU, and for cases like games, the event loop still runs on the CPU. As a result, the CPU needs to be able to both read, and write to memory that the GPU will operate on. There are two common solutions to this problem.
  • DMA engine
    • Setup overhead.
      • Need to deal with asynchronous (and possibly out of order) completion. Latencies involved with both setup and completion notification.
      • Need to actually program the interface via MMIO, or send a command to the GPU7
    • Unlikely to re-arrange or process memory
      • tile/detile surfaces8.
      • can’t take page faults, pages must be pinned
    • No size restrictions (I guess that’s implementation specific)
    • Completely asynchronous – the CPU is free to do whatever else needs doing.
  • Aperture
    • Synchronous. Not only is it slow, but the CPU has to hand hold the data transfer.
    • Size limited/limited resource. There is really no excuse with PCIe and modern 64b platforms why the aperture can’t be as large as needed, but for Intel at least, someone must be making some excuses, because 512MB is as large as it gets for now.
    • Can swizzle as needed (for various tiling formats).
    • Simple usage model. Particularly for unified memory systems.
aper_example

Moving data via the aperture

dma_example

Moving data via DMA

The Intel GEN GPUs have no local memory9. However, DMA has very similar properties to writing the backing pages directly on unified memory systems. The aperture is still used for accesses to tiled memory, and for systems without LLC. LLC is out of scope for this post.

GTT and MMAP

There are two distinct interfaces to map an object for reading or writing. There are lots of caveats to the usage of these two methods. My point isn’t to explain how to use them (libdrm is a better way to learn to use them anyway). Rather I wanted to clear up something which confused me early on.

The first is very straightforward, and has behavior I would have expected.

struct drm_i915_gem_mmap {
#define DRM_I915_GEM_MMAP       0x1e
	/** Handle for the object being mapped. */
	__u32 handle;
	__u32 pad;
	/** Offset in the object to map. */
	__u64 offset;
	/**
	 * Length of data to map.
	 *
	 * The value will be page-aligned.
	 */
	__u64 size;
	/**
	 * Returned pointer the data was mapped at.
	 *
	 * This is a fixed-size type for 32/64 compatibility.
	 */
	__u64 addr_ptr;
};

// let bo_handle = some valid GEM BO handle to a 4k object
// What follows is a way to map the BO, and write something
memset(&arg, 0, sizeof(arg));
arg.handle = bo_handle;
arg.offset = 0;
arg.size = 4096;
ioctl(fd, DRM_IOCTL_I915_GEM_MMAP, &arg);
*((uint32_t *)arg.addr_ptr) = 0xdefecate;

I might be projecting my ineptitude on the reader, but, it’s the second interface which caused me a lot of confusion, and one which I’ll talk briefly about. The interface itself is even simpler smaller:

#define DRM_I915_GEM_MMAP_GTT   0x24
struct drm_i915_gem_mmap_gtt {
	/** Handle for the object being mapped. */
	__u32 handle;
	__u32 pad;
	/**
	 * Fake offset to use for subsequent mmap call
	 *
	 * This is a fixed-sizeso [sic] type for 32/64 compatibility.
	 */
	__u64 offset;
};

Why do I think this is confusing? The name itself never quite made sense – what use is there in mapping an object to the GTT? Furthermore, how does mapping it to the GPU allow me to do anything with in from userspace. For one thing, I had confused, “mmap” with, “map.” The former really does identify the recipient (the CPU, not the GPU) of the mapping. If follows the conventional use of mmap(). The other thing is that the interface has an implicit meaning. A GTT map here actually means a GTT mapping within the aperture space. Recall that the aperture is a subset of the GTT which can be accessed through a PCI BAR. Therefore, what this interface actually does is return a token to userspace which can be mmap’d to get the CPU mapping (through the BAR, to the GPU memory). Like I said before, there are a lot of caveats with the decisions to use one vs. the other which depend on platform, the type of surface you are operating on, and available aperture space at the time of the call. All of these things will not be discussed.

Conceptualized view of mmap and mmap_gtt

Conceptualized view of mmap and mmap_gtt

Finally, here is a snippet of code from intel-gpu-tools that hopefully just encapsulates what I said and drew.

mmap_arg.handle = handle;
assert(drmIoctl(fd, DRM_IOCTL_I915_GEM_MMAP_GTT, &mmap_arg) == 0);
assert(mmap64(0, OBJECT_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_arg.offset));

Summary

This is how modern Intel GPUs deal with system memory on all platforms without a PPGTT (or if you disable it via module parameter). Although I happily skipped over the parts about tiling, fences, and cache coherency, rest assured that if you understood all of this post, you have a good footing. Going over the HSW docs again for this post, I am really pleased with how much Intel has improved the organization, and clarity. I highly encourage you to go off and read those for any missing pieces.

Please let me know about any bugs, or feature requests in this post. I would be happy to add them as time allows.

Here are links to SVGs of all the images I created. Feel free to use them how you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/06/overview_standard.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/bo_mapped.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/dma_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/aper_example.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/06/mmaps.svg

Download PDF

  1. when using the VT-d the address is actually an I/O address rather than the physical address 

  2. Previous gens went to 39 

  3. I have submitted two patch series, one of which has been reverted, the other, never merged, which allow invalid PTEs for debug purposes 

  4. intel_gtt is currently not supported for GEN8+. If someone wants to volunteer to update this tool for gen8, please let me know 

  5. I’ve fought to call this operation, “map” 

  6. Empirically (for me), GEN7+ GPUs have behaved themselves quite well after taking the page fault. I very much believe we should be using this feature as much as possible to help userspace driver developers 

  7. I’ve previously written a post on how this works for Intel 

  8. Sorry people, this one is too far out of scope for and explanation in this post. Just trust it’s a limitation if you don’t understand. Daniel Vetter probably wrote an article about it if you feel like heading over to his blog

  9. There are several distinct caches on all modern GEN GPUs, as well as eDRAM for Intel’s Iris Pro. The combined amount of this “local” memory is actually greater than many earlier discrete GPUs 

June 05, 2014

I don’t know if I’ve ever eaten my own dogfood that smells this risky.

A few days ago, I published patches to support dynamic page table allocation and tear-down in the i915 driver http://lists.freedesktop.org/archives/intel-gfx/2014-March/041814.html. This work will eventually help us support expanded page tables (similar to how things work for normal Linux page tables). The patches rely on using full PPGTT support, which still requires some work to get enabled by default. As a result, I’ll be carrying around this work for quite a while. The patches provide a lot of opportunity to uncover all sorts of weird bugs we’ve never seen due to the more stressful usage of the GPU’s TLBs. To avoid the patches getting too stale, and to further the bug extermination, I’ve figured, why not run it myself?

If you feel like some serious pain, or just want to help me debug it, give it a go – there should be absolutely no visible gain for you, only harm. You can either grab the patches from the mailing list, patchwork, or my branch.  Make sure to turn on full PPGTT support with i915.enable_ppgtt=2. If you do decide to opt for the pain, you can take comfort in the fact that you’re helping get the next big piece of prep work in place.

The question is, how long before I get sick of this terrible dogfood? I’m thinking by Monday I’ll be finished :D

Download PDF
This is a short and vague glimpse to the interfaces that the Linux kernel offers to user space for display and graphics management, from the history to what is hot and new, to what might perhaps be coming after. The topic came current for me when I started preparing Weston for global thermonuclear war.

The pre-history


In the age of dragons, kernel mode setting did not exist. There was only user space mode setting, where the job of the kernel driver (if any) was simply to give user space direct access to the graphics card registers. A user space driver (well, Xorg video DDX, really, err... or what it was at the time of XFree86) would then poke the card registers to set a mode. The kernel had no idea of anything.

The kernel DRM infrastructure was started as an out-of-tree kernel module for cooperating between multiple programs wanting to access the graphics card's resources. Later it was (partially?) merged into the kernel tree (the year is a lie, 2.3.18 came out in 1999), and much much later it was finally deleted from the libdrm repository.

The middle age


For some time, the kernel DRM existed alongside user space mode setting. It was a dark time full of crazy hacks to keep it all together with duct tape, barbwire and luck. GPUs and hardware accelerated OpenGL started to come up.

The new age


With the invent of kernel mode setting (KMS), the DRM kernel drivers got in charge of the graphics card resources: outputs, video modes, memory allocations, hotplug! User space mode setting became obsolete and was eventually killed. The kernel driver was finally actually in control of the graphics hardware.

KMS probably started with just setting the main framebuffer (primary plane) for each "CRTC" and programming the video mode. A CRTC is for "cathode-ray tube controller", but essentially means a block that reads memory (a framebuffer) and produces a bitstream according to video mode timings. The bitstream is directed into an "encoder", which turns it into a proper physical/analogue signal, like VGA or digital DVI. The signal then exits the graphics card though a "connector". CRTC, encoder, and connector are the basic concepts in KMS API. Quite often these can be combined in some restricted ways, like a single CRTC feeding two encoders for clone mode.

Even ancient hardware supported hardware cursors: a small sprite that was composited into the outgoing video signal on the fly, which meant that it was very cheap to move around. Cursor being so special, and often with funny color format (alpha!), got its very own DRM ioctl.

There were also hardware overlays (additional or secondary planes) on some hardware. While the primary framebuffer covers the whole display, an overlay is another buffer (just like the cursor) that gets mixed into the bitstream at the CRTC level. It is like basic compositing done on the scanout hardware level. Overlays usually had additional benefits, for example they could apply scaling or color space conversion (hello, video players) very efficiently. Overlays being different, they too got their very own DRM ioctls.

The KMS user space ABI was anything but atomic. With the X11 tradition, it wasn't too important how to update the displays, as long as the end result eventually was what you wanted. Race conditions in content updates didn't matter too much either, as X was racy as hell anyway. You update the CRTC. Then you update each overlay. You might update the cursor, too. By luck, all these updates could hit the same vblank. Or not. Or you don't hit vblank at all, and get tearing. No big deal, as X was essentially all about front-buffer rendering anyway. (And then there were huge efforts in trying to fix it all up with X, GLX, Mesa and GL-compositors, and avoid tearing, and it ended up complicated.)

With the advent of X compositing managers, that did not play well with the  awkward X11 protocol (Xv) or the hardware overlays, and with rise of the  GPU power and OpenGL, it was thought that hardware overlays would  eventually die out. Turned out the benefits of hardware overlays were too great to abandon, and with Wayland we again have a decent chance to make the most of them while still enjoying compositing.

The global thermonuclear war (named after a git branch by Rob Clark)


The quality of display updates became important. People do not like tearing. Someone actually wanted to update the primary framebuffer and the overlays on the same vblank, guaranteed. And the cursor as the cherry on top.

We needed one ABI to rule them all.

Universal planes brings framebuffers (primary planes), overlays (secondary planes) and cursors (cursor planes) together under the same API. No more type specific ioctls, but common ioctls shared by them all. As these objects are still somewhat different, overlays having wildly differing features and vendors wanting to expose their own stuff, object properties were invented.

An object property is essentially a {key, value} pair. In the API, the name of a key is a string. Each object has its own set of keys. To use a key, you must know it by name, fetch the handle, and then use the handle when setting the value. Handles seem to be per-object, so make sure to fetch them separately for each.

Atomic mode setting and nuclear pageflip are two sides of the same feature. Atomicity is achieved by gathering a set of property changes, and then pushing them all into the kernel in a single ioctl call. Then that call either succeeds or fails as a whole. Libdrm offers a drmModePropertySet for gathering the changes. Everything is exposed as properties: the attached FB, overlay position, video mode, etc.

Atomic mode setting means setting the output modes of a single graphics device, more or less. Devices may have hard to express limitations. A simple example is the available scanout memory bandwidth: You can drive either two mid-resolution outputs, or one high-resolution output. Or maybe some crtc-encoder-connector combination is not possible with a particular other combination for another output. Collecting the video mode, encoder and connector setup over the whole grahics card into a single operation avoids flicker. Either the whole set succeeds, or it fails. Without atomic mode setting, changing multiple outputs would not only take longer, but if some step failed, you'd have to undo all earlier steps (and hope the undo steps don't fail). Plus, there would be no way to easily test if a certain combination is possible. Atomic mode setting fixes all this.

Nuclear pageflip is about synchronizing the update of a single output (monitor) and making that atomic. This means that when user space wants to update the primary framebuffer, move the cursor, and update a couple of overlays, all those changes happen at the same vblank. Again it all either succeeds or fails. "Every frame is perfect."

And then there shall be ponies (at the end of the rainbow)


Once the global thermonuclear war is over, we have the perfect ABI for driving display updates.

Well, almost. Enter NVidia G-Sync, or AMD's FreeSync which is actually backed by a VESA standard. Dynamically variable refresh rate. We have no way yet for timing display updates in DRM. All we can do is kick out a display update, and it will hopefully land on the next vblank, whenever that is. But we can't tell the DRM when we would like it to be. Everything so far assumes, that the display refresh rate is a constant, apart from an explicit mode switch. Though I have heard that e.g. Chrome for Intel (i915, LVDS/eDP reclocking) has some hacks that opportunistically drops the refresh rate to save power.

There is also a culprit in the DRM of today (Jun 3rd, 2014). You can schedule a pageflip, but if you have pending rendering on that framebuffer for the same GPU as were you are presenting it, the pageflip will not happen until the rendering completes. And you do not know when it will complete, which means you do not know if you will hit the very next vblank or something later.

If the rendering GPU is not the same graphics device that presents the framebuffer, you do not get synchronization at all. That means that you may be scanning out an incomplete rendering for a frame or two, or you have to stall the GPU to make sure it is done before scheduling the page flip. This should be fixed with the fences related to dma-bufs (Hi, Maarten Lankhorst).

And so the unicorn keeps on running.
May 30, 2014

Last week was the OpenStack Design Summit in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments.

The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me!

On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week.

Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our Etherpads instances. Here is a short summary of the sessions I've attended.

Scaling the central agent

I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the tooz library that we worked on at eNovance during the last 6 months.

Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno.

For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted.

Test strategy

The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet.

Complex queries and per-user/project data collection

This session, led by Ildikó Váncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API.

Rethinking Ceilometer as a Time-Series-as-a-Service

This was my main session, the reason we met on Monday for a few hours, and one of the most promising session – I hope – of the week.

It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles.

Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called Gnocchi, which provides exactly that functionality in a hopefully scalable way.

Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on Pandas and Swift. The canonical resource indexer driver is based on SQLAlchemy.

The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release.

Revisiting the Ceilometer data model

This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way.

Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver.

Ceilometer devops session

We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts.

SNMP inspectors

This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer.

Alarm and logs improvements

This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems.

Conclusion

Considering the current QA problems with Ceilometer, Eoghan Glynn, the new Project Technical Leader for Ceilometer, clearly indicated that this will be the main focus of the release cycle.

Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.