planet.freedesktop.org
July 16, 2018

Add infrastructure for Vblank and page flip events in vkms simulated by hrtimer


Since the beginning of May 2018, I have been diving into the DRM subsystem. In the beginning, nothing made sense to me, and I had to fight hard to understand how things work. Fortunately, I was not alone, and I had great support from Gustavo Padovan, Daniel Vetter, Haneen Mohammed, and the entire community. Recently, I finally delivered a new feature for VKMS: the infrastructure for Vblank and page flip events.

At this moment, VKMS have regular Vblank events simulated through hrtimers (see drm-misc-next), which is a feature required by VKMS to mimic real hardware [6]. The development approach was entirely driven by the tests provided by IGT, more specifically the kms_flip. I modified IGT to read a module name via command line and force the use of it, instead of using only the modules defined in the code (patch submitted to IGT, see [1]). With this modification in the IGT, my development process to add a Vblank infrastructure to VKMS had three main steps as Figure 1 describes.

foo Figure 1: My work cycle in VKMS

Firstly, I focused only on the subtest “basic-plain-flip” from IGT and after each execution of the test I checked the failure messages. Secondly, I tried to write the required code to make the test pass; it is essential to highlight that this phase sometimes took me days to understand the problem and implement the fix. Finally, after I overcome the failure, I just put an additional effort to improve the implementation. As can be seen in the patchset send to add the Vblank support [2], the first set of patches was not directly related to the Vblank itself, but it was a necessary infrastructure required for kms_flip to work.

foo Figure 2: sudo ./tests/kms_flip --run-subtest basic-plain-flip --force-module vkms

After an extended period of work to make VKMS pass in the basic-plain-flip, I finally achieved it thanks to all the support that I received from the DRM community. Next, I started to work on the subtest “wf_vblank-ts-check”, and here I spent a lot of time debugging problems. My issue here was due to the stochastic test behavior, sometimes it passed and other, it fails, and I supposed the problem was related to the accumulation of errors during the page flip step. As a result, I put a considerable effort to make the timer in the page flip precise, I end up with a patch that calculates the exact moment for the next period (see [5]). Nevertheless, after I submitted the patch, Chris Wilson highlighted that I was reinventing the wheel since hrtimer already did the required calculations [4]; he was 100% right, after his comment I looked line by line of hrtimer_forward, and I concluded that I implemented the same algorithm. I lost some days recreating something that is not useful in the end; however, it was really valuable for me since I learned how hrtimer works and also expanded my comprehension about Vblank. Finally, Daniel Vetter precisely pointed out a series of problems in the patch (see [3]) that not only improved the tests, but also made most of the tests in the kms_flip pass.

In conclusion, adding the infrastructure for Vblank and page flip events in vkms was an exciting feature for VKMS, but also it was an important task to teach me how things work in the DRM. I am still focused on this part of the VKMS, but now, I am starting to think how can I add virtual hardware which does not support Vblank interrupt. Finally, I want to write a detailed blog post on how I implemented the Vblank support in the VKMS and another post about timers (users and kernel space); I believe this sort of post could be helpful for someone that is just starting in the DRM subsystem.

Thanks for all the DRM community that is always kind and provide great help for a newcomer like me :)

Reference

  1. Force option in IGT
  2. Adding infrastructure for Vblank and page flip events in vkms
  3. Daniel Vetter comments in V2
  4. Chris Wilson comments about hrtimer
  5. Calculating the period
  6. DRM misc-next
July 09, 2018

Now that my V3D GPU hangs are cleared up, I made a lot of progress on conformance:

Fixes from last week:

  • Don’t reset the GPU until it stops making progress on the job (instead of just a fixed timeout).
  • Fixed noperspective interpolation
  • Fixed leaks of default attributes and spill BOs.
  • Fixed ARB_color_buffer_float clamping of gl_FragData[]
  • Fixed support for GL_EXT_draw_buffers2’s independent blend enables.
  • Fixed GL_SAMPLE_ALPHA_TO_ONE.

Fail/(Pass+Fail) ratios after my weekend run:

  • EGL: 96/2143
  • GLES2: 7/16985
  • GLES3: 25/42549
  • GLES3 multisample: crash!

I’ve debugged the first MSAA crash, but there are some MSAA resolve bugs after that which I need to sort out (some of which are probably the cause of those EGL failures).

On the VC4 front, Boris landed the writeback changes, and I submitted the DT for it in my ARM pull request.

July 08, 2018

A common error when building from source is something like the error below:


meson.build:50:0: ERROR: Native dependency 'foo' not found
or a similar warning

meson.build:63:0: ERROR: Invalid version of dependency, need 'foo' ['>= 1.1.0'] found '1.0.0'.
Seeing that can be quite discouraging, but luckily, in many cases it's not too difficult to fix. As usual, there are many ways to get to a successful result, I'll describe what I consider the simplest.

What does it mean? Dependencies are simply libraries or tools that meson needs to build the project. Usually these are declared like this in meson.build:


dep_foo = dependency('foo', version: '>= 1.1.0')
In human words: "we need the development headers for library foo (or 'libfoo') of version 1.1.0 or later". meson uses the pkg-config tool in the background to resolve that request. If we require package foo, pkg-config searches for a file foo.pc in the following directories:
  • /usr/lib/pkgconfig,
  • /usr/lib64/pkgconfig,
  • /usr/share/pkgconfig,
  • /usr/local/lib/pkgconfig,
  • /usr/local/share/pkgconfig
The error message simply means pkg-config couldn't find the file and you need to install the matching package from your distribution or from source.

And important note here: in most cases, we need the development headers of said library, installing just the library itself is not sufficient. After all, we're trying to build against it, not merely run against it.

What package provides the foo.pc file?

In many cases the package is the development version of the package name. Try foo-devel (Fedora, RHEL, SuSE, ...) or foo-dev (Debian, Ubuntu, ...). yum and dnf provide a great shortcut to install any pkg-config dependency:


$> dnf install "pkgconfig(foo)"

$> yum install "pkgconfig(foo)"
will automatically search and install the right package, including its dependencies.
apt-get requires a bit more effort:

$> apt-get install apt-file
$> apt-file update
$> apt-file search --package-only foo.pc
foo-dev
$> apt-get install foo-dev
For those running Arch and pacman, the sequence is:

$> pacman -S pkgfile
$> pkgfile -u
$> pkgfile foo.pc
extra/foo
$> pacman -S extra/foo
Once that's done you can re-run meson and see if all dependencies have been met. If more packages are missing, follow the same process for the next file.

Any users of other distributions - let me know how to do this on yours and I'll update the post

My version is wrong!

It's not uncommon to see the following error after installing the right package:


meson.build:63:0: ERROR: Invalid version of dependency, need 'foo' ['>= 1.1.0'] found '1.0.0'.
Now you're stuck and you have a problem. What this means is that the package version your distribution provides is not new enough to build your software. This is where the simple solutions and and it all gets a bit more complicated - with more potential errors. Unless you are willing to go into the deep end, I recommend moving on and accepting that you can't have the newest bits on an older distribution. Because now you have to build the dependencies from source and that may then require to build their dependencies from source and before you know you've built 30 packages. If you're willing read on, otherwise - sorry, you won't be able to run your software today.

Manually installing dependencies

Now you're in the deep end, so be aware that you may see more complicated errors in the process. First of all you need to figure out where to get the source from. I'll now use cairo as example instead of foo so you see actual data. On rpm-based distributions like Fedora run dnf or yum:


$> dnf info cairo-devel # or yum info cairo-devel
Loaded plugins: auto-update-debuginfo, langpacks
Installed Packages
Name : cairo-devel
Arch : x86_64
Version : 1.13.1
Release : 0.1.git337ab1f.fc20
Size : 2.4 M
Repo : installed
From repo : fedora
Summary : Development files for cairo
URL : http://cairographics.org
License : LGPLv2 or MPLv1.1
Description : Cairo is a 2D graphics library designed to provide high-quality
: display and print output.
:
: This package contains libraries, header files and developer
: documentation needed for developing software which uses the cairo
: graphics library.
The important field here is the URL line - got to that and you'll find the source tarballs. That should be true for most projects but you may need to google for the package name and hope. Search for the tarball with the right version number and download it. On Debian and related distributions, cairo is provided by the libcairo2-dev package. Run apt-cache show on that package:

$> apt-cache show libcairo2-dev
Package: libcairo2-dev
Source: cairo
Version: 1.12.2-3
Installed-Size: 2766
Maintainer: Dave Beckett
Architecture: amd64
Provides: libcairo-dev
Depends: libcairo2 (= 1.12.2-3), libcairo-gobject2 (= 1.12.2-3),[...]
Suggests: libcairo2-doc
Description-en: Development files for the Cairo 2D graphics library
Cairo is a multi-platform library providing anti-aliased
vector-based rendering for multiple target backends.
.
This package contains the development libraries, header files needed by
programs that want to compile with Cairo.
Homepage: http://cairographics.org/
Description-md5: 07fe86d11452aa2efc887db335b46f58
Tag: devel::library, role::devel-lib, uitoolkit::gtk
Section: libdevel
Priority: optional
Filename: pool/main/c/cairo/libcairo2-dev_1.12.2-3_amd64.deb
Size: 1160286
MD5sum: e29852ae8e8e5510b00b13dbc201ce66
SHA1: 2ed3534d02c01b8d10b13748c3a02820d10962cf
SHA256: a6099cfbcc6bd891e347dd9abc57b7f137e0fd619deaff39606fd58f0cc60d27
In this case it's the Homepage line that matters, but the process of downloading tarballs is the same as above. For Arch users, the interesting line is URL as well:

$> pacman -Si cairo | grep URL
Repository : extra
Name : cairo
Version : 1.12.16-1
Description : Cairo vector graphics library
Architecture : x86_64
URL : http://cairographics.org/
Licenses : LGPL MPL
....

Now to the complicated bit: In most cases, you shouldn't install the new version over the system version because you may break other things. You're better off installing the dependency into a custom folder ("prefix") and point pkg-config to it. So let's say you downloaded the cairo tarball, now you need to run:


$> mkdir $HOME/dependencies/
$> tar xf cairo-someversion.tar.xz
$> cd cairo-someversion
$> autoreconf -ivf
$> ./configure --prefix=$HOME/dependencies
$> make && make install
$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig
# now go back to original project and run meson again
So you create a directory called dependencies and install cairo there. This will install cairo.pc as $HOME/dependencies/lib/cairo.pc. Now all you need to do is tell pkg-config that you want it to look there as well - so you set PKG_CONFIG_PATH. If you re-run meson in the original project, pkg-config will find the new version and meson should succeed. If you have multiple packages that all require a newer version, install them into the same path and you only need to set PKG_CONFIG_PATH once. Remember you need to set PKG_CONFIG_PATH in the same shell as you are running configure from.

In the case of dependencies that use meson, you replace autotools and make with meson and ninja:


$> mkdir $HOME/dependencies/
$> tar xf foo-someversion.tar.xz
$> cd foo-someversion
$> meson builddir -Dprefix=$HOME/dependencies
$> ninja -C builddir install
$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig
# now go back to original project and run meson again

If you keep seeing the version error the most common problem is that PKG_CONFIG_PATH isn't set in your shell, or doesn't point to the new cairo.pc file. A simple way to check is:


$> pkg-config --modversion cairo
1.13.1
Is the version number the one you installed or the system one? If it is the system one, you have a typo in PKG_CONFIG_PATH, just re-set it. If it still doesn't work do this:

$> cat $HOME/dependencies/lib/pkgconfig/cairo.pc
prefix=/usr
exec_prefix=/usr
libdir=/usr/lib64
includedir=/usr/include

Name: cairo
Description: Multi-platform 2D graphics library
Version: 1.13.1

Requires.private: gobject-2.0 glib-2.0 >= 2.14 [...]
Libs: -L${libdir} -lcairo
Libs.private: -lz -lz -lGL
Cflags: -I${includedir}/cairo
If the Version field matches what pkg-config returns, then you're set. If not, keep adjusting PKG_CONFIG_PATH until it works. There is a rare case where the Version field in the installed library doesn't match what the tarball said. That's a defective tarball and you should report this to the project, but don't worry, this hardly ever happens. In almost all cases, the cause is simply PKG_CONFIG_PATH not being set correctly. Keep trying :)

Let's assume you've managed to build the dependencies and want to run the newly built project. The only problem is: because you built against a newer library than the one on your system, you need to point it to use the new libraries.


$> export LD_LIBRARY_PATH=$HOME/dependencies/lib
and now you can, in the same shell, run your project.

Good luck!

July 02, 2018

For V3D, last week I spent mostly building up .CLIF dumping support so that I could talk to the HW team about the GPU hangs I’ve been experiencing. This involved reformatting my dump output, naming some missing (0-filled) fields of the GPU packets I emit, and restructuring the dumper so that I could dump the contents of each BO all at once (whereas before my dumps didn’t really care about BOs and would just dump each memory area as it was found in the worklist).

The final result was a CLIF file I sent to the HW team of a simple transform feedback test that was locking up the GPU. They reported that I was missing a wait packet that was required, beyond what the HW specs said. (I had actually earlier tried emitting the wait, but typoed the condition for enabling it!) With that, I think I’ve cleared up all of the GPU hangs in piglit runs of khr_gles3.

On the VC4 front, the DSI panel enable/disable sequencing patch is in, which should enable other people to successfully build DSI panel drivers for Raspberry Pi. Boris has resubmitted the VC4 writeback (TXP) series now that the core writeback support is in, and there were just a couple of cleanups and it should now be ready to land.

June 26, 2018

This post is part of a series: Part 1, Part 2, Part 3, Part 4, Part 5.

In this post I'll describe the X server pointer acceleration for trackpoints. You will need to read Observations on trackpoint input data first to make sense of this post.

As described in that linked post, trackpoint input data varies wildly. Combined with the options we have in the server to configure everything makes this post a bit pointless as almost every single behaviour can be changed.

The linked post also describes the three subjective pressure ranges: no real physical pressure, some physical pressure, and serious pressure. The line between the first two ranges is roughly where the trackpoint sends deltas at the maximum reporting rate (100Hz) but with a value of 1. Below that pressure, the intervals increase but the delta remains at 1. Above that pressure, the interval remains constant at 10ms but the deltas increase. I've used the default kernel trackpoint sensitivity of 128 for any data listed here. Here is the visualisation of how deltas and intervals change again.

The default pointer acceleration profile in the X server is the simple profile. We know this from the earlier posts, it has a double-plateau shape. On a trackpoint mm/s doesn't make sense here so let's look at it in units/ms instead. A unit is simply a device-specific measurement of distance/pressure/tilt/whatever - it all depends on the device. On trackpoints that is (mostly) sideways pressure or tilt. On mice and touchpads we can convert units to mm based on their resolution. On trackpoints, we don't have a physical reference and we thus have to deal with it in units. The obvious problem here is that 1 unit on one device does not equal 1 unit on another device. And for configurable trackpoints, the definition of a unit changes as the sensitivity changes. And that's after the kernel already mangles it (if it does, it doesn't for all devices). So here's a box of asterisks, please sprinkle it liberally.

The smallest delta the kernel can send is 1. At a hardware report rate of 100Hz, continuous pressure to the smallest detected threshold thus generates 1 unit every 10 milliseconds or 0.1 units/ms. If I push uncomfortably hard, I can get deltas of around 10 units every 10ms or 1 unit/ms. In other words, we better zoom in here. Let's look at the meaningful range of this curve.

On my trackpoint, below 0.1 units/ms means virtually no pressure (pressure range one). Pressure range two is 0.1 to 0.4, approximately. Beyond that is pressure range three but that is also the range that becomes pointless quickly - I simply wouldn't want to press this hard in normal operation. 1 unit per ms (10 units per report) is very high pressure. This means the pointer acceleration curve is actually defined for the usable range with only outliers hitting the maximum acceleration. For mice this curve was effectively a constant acceleration for all but slow movements (see here). However, any configuration can change this curve to a point where none of the above applies.

Back to the minimum constant movement of 0.1 units/ms. That one effectively matches the start of the 'no accel' plateau. Anything below that will be decelerated, i.e. a delta of 1 unit will result a pointer delta less than 1 pixel. In other words, anything up to where you have to apply real pressure is decelerated.

The constant factor plateau goes all the way to 0.4 units/ms. Then there's the buggy jump to a factor of ~1.5, followed by a smooth curve to 0.8 units/ms where the factor maxes out. A bit of testing here suggests that 0.4 units/ms is in the upper limits of the second pressure range mentioned above. Going past 0.6 or 0.7 is definitely well within the third pressure range where things get uncomfortable quickly. This means that the acceleration bug is actually sitting right in the highest interesting range. Apparently no-one has noticed for 10 years.

But what does it matter? Well, probably not even that much. The only interesting bit I I can see here is that we have deceleration for most low-pressure movements and a constant acceleration of 1 for most realistic movements. I very much doubt that the range above 0.4 really matters.

But hey, this is just the default configuration. It is affected when someone changes the speed slider in GNOME, or when someone changes the sensitivity at the sysfs level. Other trackpoints wont have the exact same behaviour. Any analysis is thrown out of the window as soon as someone changes the sysfs sensitivity or increases the acceleration threshold.

Let's talk sysfs - if we increase my trackpoint sensitivity to 200, the deltas coming from the trackpoint change. First, the pressure required to give me a constant stream of events often gives me deltas of size 2 or 3. So we're half-way into the no acceleration plateau here. Higher pressures easily give me deltas of size 10 or 1 unit per ms, the edge of the image above.

I wish I could analyse this any further but realistically, the only takeaway here is that any change in configuration options results in some version of trial-and-error by the user until the trackpoint moves as they want to. But without knowing all those options, we just cannot know what exactly is happening.

However, what this is useful for is comparing it to libinput. libinput got a custom trackpoint acceleration function in 1.8, designed around the hardware delta range. The idea was that you (or someone) measures the trackpoint device's range once, if it's outside of the assumed default ranges we add a hwdb entry and voila, it scales back to the right ranges and that device is fixed for good.

Except - this doesn't work. libinput scales into the delta range and calculates the factor from that but it doesn't take the time stamps into account. It works on the assumption that a trackpoint deltas are at a constant frequency with a varying delta. That is simply not the case and the dynamic range of the trackpoint is so small that any acceleration of the deltas results in jerky movement.

This is of course fixable, we can just convert the deltas into a speed and then apply the acceleration curve based on that. So that's the next task, if you're interested in that, subscribe yourself to this issue.

Portable Services with systemd v239

systemd v239 contains a great number of new features. One of them is first class support for Portable Services. In this blog story I'd like to shed some light on what they are and why they might be interesting for your application.

What are "Portable Services"?

The "Portable Service" concept takes inspiration from classic chroot() environments as well as container management and brings a number of their features to more regular system service management.

While the definition of what a "container" really is is hotly debated, I figure people can generally agree that the "container" concept primarily provides two major features:

  1. Resource bundling: a container generally brings its own file system tree along, bundling any shared libraries and other resources it might need along with the main service executables.

  2. Isolation and sand-boxing: a container operates in a name-spaced environment that is relatively detached from the host. Besides living in its own file system namespace it usually also has its own user database, process tree and so on. Access from the container to the host is limited with various security technologies.

Of these two concepts the first one is also what traditional UNIX chroot() environments are about.

Both resource bundling and isolation/sand-boxing are concepts systemd has implemented to varying degrees for a longer time. Specifically, RootDirectory= and RootImage= have been around for a long time, and so have been the various sand-boxing features systemd provides. The Portable Services concept builds on that, putting these features together in a new, integrated way to make them more accessible and usable.

OK, so what precisely is a "Portable Service"?

Much like a container image, a portable service on disk can be just a directory tree that contains service executables and all their dependencies, in a hierarchy resembling the normal Linux directory hierarchy. A portable service can also be a raw disk image, containing a file system containing such a tree (which can be mounted via a loop-back block device), or multiple file systems (in which case they need to follow the Discoverable Partitions Specification and be located within a GPT partition table). Regardless whether the portable service on disk is a simple directory tree or a raw disk image, let's call this concept the portable service image.

Such images can be generated with any tool typically used for the purpose of installing OSes inside some directory, for example dnf --installroot= or debootstrap. There are very few requirements made on these trees, except the following two:

  1. The tree should carry systemd unit files for relevant services in them.

  2. The tree should carry /usr/lib/os-release (or /etc/os-release) OS release information.

Of course, as you might notice, OS trees generated from any of today's big distributions generally qualify for these two requirements without any further modification, as pretty much all of them adopted /usr/lib/os-release and tend to ship their major services with systemd unit files.

A portable service image generated like this can be "attached" or "detached" from a host:

  1. "Attaching" an image to a host is done through the new portablectl attach command. This command dissects the image, reading the os-release information, and searching for unit files in them. It then copies relevant unit files out of the images and into /etc/systemd/system/. After that it augments any copied service unit files in two ways: a drop-in adding a RootDirectory= or RootImage= line is added in so that even though the unit files are now available on the host when started they run the referenced binaries from the image. It also symlinks in a second drop-in which is called a "profile", which is supposed to carry additional security settings to enforce on the attached services, to ensure the right amount of sand-boxing.

  2. "Detaching" an image from the host is done through portable detach. It reverses the steps above: the unit files copied out are removed again, and so are the two drop-in files generated for them.

While a portable service is attached its relevant unit files are made available on the host like any others: they will appear in systemctl list-unit-files, you can enable and disable them, you can start them and stop them. You can extend them with systemctl edit. You can introspect them. You can apply resource management to them like to any other service, and you can process their logs like any other service and so on. That's because they really are native systemd services, except that they have 'twist' if you so will: they have tougher security by default and store their resources in a root directory or image.

And that's already the essence of what Portable Services are.

A couple of interesting points:

  1. Even though the focus is on shipping service unit files in portable service images, you can actually ship timer units, socket units, target units, path units in portable services too. This means you can very naturally do time, socket and path based activation. It's also entirely fine to ship multiple service units in the same image, in case you have more complex applications.

  2. This concept introduces zero new metadata. Unit files are an existing concept, as are os-release files, and — in case you opt for raw disk images — GPT partition tables are already established too. This also means existing tools to generate images can be reused for building portable service images to a large degree as no completely new artifact types need to be generated.

  3. Because the Portable Service concepts introduces zero new metadata and just builds on existing security and resource bundling features of systemd it's implemented in a set of distinct tools, relatively disconnected from the rest of systemd. Specifically, the main user-facing command is portablectl, and the actual operations are implemented in systemd-portabled.service. If you so will, portable services are a true add-on to systemd, just making a specific work-flow nicer to use than with the basic operations systemd otherwise provides. Also note that systemd-portabled provides bus APIs accessible to any program that wants to interface with it, portablectl is just one tool that happens to be shipped along with systemd.

  4. Since Portable Services are a feature we only added very recently we wanted to keep some freedom to make changes still. Due to that we decided to install the portablectl command into /usr/lib/systemd/ for now, so that it does not appear in $PATH by default. This means, for now you have to invoke it with a full path: /usr/lib/systemd/portablectl. We expect to move it into /usr/bin/ very soon though, and make it a fully supported interface of systemd.

  5. You may wonder which unit files contained in a portable service image are the ones considered "relevant" and are actually copied out by the portablectl attach operation. Currently, this is derived from the image name. Let's say you have an image stored in a directory /var/lib/portables/foobar_4711/ (or alternatively in a raw image /var/lib/portables/foobar_4711.raw). In that case the unit files copied out match the pattern foobar*.service, foobar*.socket, foobar*.target, foobar*.path, foobar*.timer.

  6. The Portable Services concept does not define any specific method how images get on the deployment machines, that's entirely up to administrators. You can just scp them there, or wget them. You could even package them as RPMs and then deploy them with dnf if you feel adventurous.

  7. Portable service images can reside in any directory you like. However, if you place them in /var/lib/portables/ then portablectl will find them easily and can show you a list of images you can attach and suchlike.

  8. Attaching a portable service image can be done persistently, so that it remains attached on subsequent boots (which is the default), or it can be attached only until the next reboot, by passing --runtime to portablectl.

  9. Because portable service images are ultimately just regular OS images, it's natural and easy to build a single image that can be used in three different ways:

    1. It can be attached to any host as a portable service image.

    2. It can be booted as OS container, for example in a container manager like systemd-nspawn.

    3. It can be booted as host system, for example on bare metal or in a VM manager.

    Of course, to qualify for the latter two the image needs to contain more than just the service binaries, the os-release file and the unit files. To be bootable an OS container manager such as systemd-nspawn the image needs to contain an init system of some form, for example systemd. To be bootable on bare metal or as VM it also needs a boot loader of some form, for example systemd-boot.

Profiles

In the previous section the "profile" concept was briefly mentioned. Since they are a major feature of the Portable Services concept, they deserve some focus. A "profile" is ultimately just a pre-defined drop-in file for unit files that are attached to a host. They are supposed to mostly contain sand-boxing and security settings, but may actually contain any other settings, too. When a portable service is attached a suitable profile has to be selected. If none is selected explicitly, the default profile called default is used. systemd ships with four different profiles out of the box:

  1. The default profile provides a medium level of security. It contains settings to drop capabilities, enforce system call filters, restrict many kernel interfaces and mount various file systems read-only.

  2. The strict profile is similar to the default profile, but generally uses the most restrictive sand-boxing settings. For example networking is turned off and access to AF_NETLINK sockets is prohibited.

  3. The trusted profile is the least strict of them all. In fact it makes almost no restrictions at all. A service run with this profile has basically full access to the host system.

  4. The nonetwork profile is mostly identical to default, but also turns off network access.

Note that the profile is selected at the time the portable service image is attached, and it applies to all service files attached, in case multiple are shipped in the same image. Thus, the sand-boxing restriction to enforce are selected by the administrator attaching the image and not the image vendor.

Additional profiles can be defined easily by the administrator, if needed. We might also add additional profiles sooner or later to be shipped with systemd out of the box.

What's the use-case for this? If I have containers, why should I bother?

Portable Services are primarily intended to cover use-cases where code should more feel like "extensions" to the host system rather than live in disconnected, separate worlds. The profile concept is supposed to be tunable to the exact right amount of integration or isolation needed for an application.

In the container world the concept of "super-privileged containers" has been touted a lot, i.e. containers that run with full privileges. It's precisely that use-case that portable services are intended for: extensions to the host OS, that default to isolation, but can optionally get as much access to the host as needed, and can naturally take benefit of the full functionality of the host. The concept should hence be useful for all kinds of low-level system software that isn't shipped with the OS itself but needs varying degrees of integration with it. Besides servers and appliances this should be particularly interesting for IoT and embedded devices.

Because portable services are just a relatively small extension to the way system services are otherwise managed, they can be treated like regular service for almost all use-cases: they will appear along regular services in all tools that can introspect systemd unit data, and can be managed the same way when it comes to logging, resource management, runtime life-cycles and so on.

Portable services are a very generic concept. While the original use-case is OS extensions, it's of course entirely up to you and other users to use them in a suitable way of your choice.

Walkthrough

Let's have a look how this all can be used. We'll start with building a portable service image from scratch, before we attach, enable and start it on a host.

Building a Portable Service image

As mentioned, you can use any tool you like that can create OS trees or raw images for building Portable Service images, for example debootstrap or dnf --installroot=. For this example walkthrough run we'll use mkosi, which is ultimately just a fancy wrapper around dnf and debootstrap but makes a number of things particularly easy when repetitively building images from source trees.

I have pushed everything necessary to reproduce this walkthrough locally to a GitHub repository. Let's check it out:

$ git clone https://github.com/systemd/portable-walkthrough.git

Let's have a look in the repository:

  1. First of all, walkthroughd.c is the main source file of our little service. To keep things simple it's written in C, but it could be in any language of your choice. The daemon as implemented won't do much: it just starts up and waits for SIGTERM, at which point it will shut down. It's ultimately useless, but hopefully illustrates how this all fits together. The C code has no dependencies besides libc.

  2. walkthroughd.service is a systemd unit file that starts our little daemon. It's a simple service, hence the unit file is trivial.

  3. Makefile is a short make build script to build the daemon binary. It's pretty trivial, too: it just takes the C file and builds a binary from it. It can also install the daemon. It places the binary in /usr/local/lib/walkthroughd/walkthroughd (why not in /usr/local/bin? because it's not a user-facing binary but a system service binary), and its unit file in /usr/local/lib/systemd/walkthroughd.service. If you want to test the daemon on the host we can now simply run make and then ./walkthroughd in order to check everything works.

  4. mkosi.default is file that tells mkosi how to build the image. We opt for a Fedora-based image here (but we might as well have used Debian here, or any other supported distribution). We need no particular packages during runtime (after all we only depend on libc), but during the build phase we need gcc and make, hence these are the only packages we list in BuildPackages=.

  5. mkosi.build is a shell script that is invoked during mkosi's build logic. All it does is invoke make and make install to build and install our little daemon, and afterwards it extends the distribution-supplied /etc/os-release file with an additional field that describes our portable service a bit.

Let's now use this to build the portable service image. For that we use the mkosi tool. It's sufficient to invoke it without parameter to build the first image: it will automatically discover mkosi.default and mkosi.build which tells it what to do. (Note that if you work on a project like this for a longer time, mkosi -if is probably the better command to use, as it that speeds up building substantially by using an incremental build mode). mkosi will download the necessary RPMs, and put them all together. It will build our little daemon inside the image and after all that's done it will output the resulting image: walkthroughd_1.raw.

Because we opted to build a GPT raw disk image in mkosi.default this file is actually a raw disk image containing a GPT partition table. You can use fdisk -l walkthroughd_1.raw to enumerate the partition table. You can also use systemd-nspawn -i walkthroughd_1.raw to explore the image quickly if you need.

Using the Portable Service Image

Now that we have a portable service image, let's see how we can attach, enable and start the service included within it.

First, let's attach the image:

# /usr/lib/systemd/portablectl attach ./walkthroughd_1.raw
(Matching unit files with prefix 'walkthroughd'.)
Created directory /etc/systemd/system/walkthroughd.service.d.
Written /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Created symlink /etc/systemd/system/walkthroughd.service.d/10-profile.conf → /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system/walkthroughd.service.
Created symlink /etc/portables/walkthroughd_1.raw → /home/lennart/projects/portable-walkthrough/walkthroughd_1.raw.

The command will show you exactly what is has been doing: it just copied the main service file out, and added the two drop-ins, as expected.

Let's see if the unit is now available on the host, just like a regular unit, as promised:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: inactive (dead)

Nice, it worked. We see that the unit file is available and that systemd correctly discovered the two drop-ins. The unit is neither enabled nor started however. Yes, attaching a portable service image doesn't imply enabling nor starting. It just means the unit files contained in the image are made available to the host. It's up to the administrator to then enable them (so that they are automatically started when needed, for example at boot), and/or start them (in case they shall run right-away).

Let's now enable and start the service in one step:

# systemctl enable --now walkthroughd.service
Created symlink /etc/systemd/system/multi-user.target.wants/walkthroughd.service → /etc/systemd/system/walkthroughd.service.

Let's check if it's running:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: active (running) since Wed 2018-06-27 17:55:30 CEST; 4s ago
 Main PID: 45003 (walkthroughd)
    Tasks: 1 (limit: 4915)
   Memory: 4.3M
   CGroup: /system.slice/walkthroughd.service
           └─45003 /usr/local/lib/walkthroughd/walkthroughd

Jun 27 17:55:30 sigma walkthroughd[45003]: Initializing.

Perfect! We can see that the service is now enabled and running. The daemon is running as PID 45003.

Now that we verified that all is good, let's stop, disable and detach the service again:

# systemctl disable --now walkthroughd.service
Removed /etc/systemd/system/multi-user.target.wants/walkthroughd.service.
# /usr/lib/systemd/portablectl detach ./walkthroughd_1.raw
Removed /etc/systemd/system/walkthroughd.service.
Removed /etc/systemd/system/walkthroughd.service.d/10-profile.conf.
Removed /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Removed /etc/systemd/system/walkthroughd.service.d.
Removed /etc/portables/walkthroughd_1.raw.

And finally, let's see that it's really gone:

# systemctl status walkthroughd
Unit walkthroughd.service could not be found.

Perfect! It worked!

I hope the above gets you started with Portable Services. If you have further questions, please contact our mailing list.

Further Reading

A more low-level document explaining details is shipped along with systemd.

There are also relevant manual pages: portablectl(1) and systemd-portabled(8).

For further information about mkosi see its homepage.

June 25, 2018

For V3D, last week included:

  • Created the DRM fourcc for talking about the new UIF tiling between processes.
  • Fixed offsets of buffers shared between processes.
  • Reduced CPU overhead and binary size of V3D and VC4 release builds.
  • Fixed min/mag determination in non-mipmapped texture filtering modes.
  • Implemented ALPHA_TO_COVERAGE.
  • Fixed flushing of jobs writing to separate stencil buffers.
  • Fixed the return status of ClientWaitSync.
  • Fixed blits from linear winsys BOs.
  • Wrote a workaround for broken transform feedback setup with gallium NIR.

For VC4, I fixed a regression in display initialization in drm-misc-next. I’ve also been doing some study of the HVS to help Boris with the T-format scanout offset patch. Hopefully with what I figured out today, he can get it all working. I also respun my DSI enable/disable sequencing patch to not need any changes in the core.

Finally, I put together the branches for bcm2835 maintainership for kernel 4.19. This week I should PR them.

June 24, 2018

Alt text

Configuring the QNAP device

The first step is to SSH into your QNAP device using the admin account.

ssh admin@NAS_IP
cat >> /share/Container/container-station-data/lib/lxc/CONTAINER_NAME/config << EOF
lxc.cgroup.devices.allow = c 10:200 rwm
EOF

Configuring the container guest

The second step is open the QNAP web-ui, open the Container Station application and enter the console of your container.

sed -i '/exit 0/d' /etc/rc.local
cat >> /etc/rc.local << EOF
if ! [ -c /dev/net/tun ]; then
    mkdir -p /dev/net
    mknod -m 666 /dev/net/tun c 10 200
fi

exit 0
EOF

Lastly the container needs to be restarted, and then your VPN application should be able to access TUN devices and work normally.

June 22, 2018
In March 1986, my dad was in the market for a Thomson TO7/70. I have the circled classified ads in “Téo” issue 1 to prove that :)



TO7/70 with its chiclet keyboard and optical pen, courtesy of MO5.com

The “Plan Informatique pour Tous” was in full swing, and Thomson were supplying schools with micro-computers. My dad, as a primary school teacher, needed to know how to operate those computers, and eventually teach them to kids.

The first thing he showed us when he got the computer, on the living room TV, was a game called “Panic” or “Panique” where you controlled a missile, protecting a town from flying saucers that flew across the screen from either side, faster and faster as the game went on. I still haven't been able to locate this game again.

A couple of years later, the TO7/70 was replaced by a TO9, with a floppy disk, and my dad used that computer to write an educational software about top-down additions, as part of a training program run by the teachers schools (“Écoles Normales” renamed to “IUFM“ in 1990).

After months of nagging, and some spring cleaning, he found the listings of his educational software, which I've liberated, with his permission. I'm currently still working out how to generate floppy disks that are usable directly in emulators. But here's an early screenshot.


Later on, my dad got an IBM PC compatible, an Olivetti PC/1, on which I'd play a clone of Asteroids for hours, but that's another story. The TO9 got passed down to me, and after spending a full summer doing planning for my hot-dog and chips van business (I was 10 or 11, and I had weird hobbies already), and entering every game from the “102 Programmes pour...” series of books, the TO9 got put to the side at Christmas, replaced by a Sega Master System, using that same handy SCART connector on the Thomson monitor.

But how does this concern you. Well, I've worked with RetroManCave on a Minitel episode not too long ago, and he agreed to do a history of the Thomson micro-computers. I did a fair bit of the research and fact-checking, as well as some needed repairs to the (prototype!) hardware I managed to find for the occasion. The result is this first look at the history of Thomson.



Finally, if you fancy diving into the Thomson computers, there will be an episode coming shortly about the MO5E hardware, and some games worth running on it, on the same YouTube channel.

I'm currently working on bringing the “TeoTO8D emulator to Flathub, for Linux users. When that's ready, grab some games from the DCMOTO archival site, and have some fun!

I'll also be posting some nitty gritty details about Thomson repairs on my Micro Repairs Twitter feed for the more technically enclined among you.

Alt text

The above layout has N = 7, yielding 42 LEDs.

Apart from the symmetry being visually pleasing compared to the normal row & column Charlieplexing layouts, it's relatively easy to spot errors in the schematic.

Avoiding lookup tables

The major advantage of twistyplexing is the ability to avoid lookup tables and replace them with some relatively straight forward arithmetic.

row = led_number / (N - 1)
column = led_number % (N - 1)
anode = (row + column + 1) % N

Of course the cathode still has to be controlled, but its pin id already defined by the row variable above.

Thanks

The Twistyplexing concept was created by Tom Yu, and defined in this blog post.

June 19, 2018

Most of the past few weeks have been focused on improving the V3D driver’s dEQP/VK-GL-CTS conformance rates. The piglit infrastructure for the conformance tests is strange, and there doesn’t seem to be a way in tree to do an approximation of a proper conformance run, so I’ve had to glue together the deqp and khr_gles profiles, with some overrides for generating the khr_gles test list to be run (since we don’t seem to have a way to run a conformance mustpass list). There were lots of deqp failures when I started, due to not having run that suite before, but I’m down to a <.5% failure rate on simulation now.

On the testing front, thanks to Maxime we now have a vc4 testlist in the i-g-t suite. Hopefully the Raspberry Pi foundation can start using that to get some coverage of the vc4 driver when they update their kernel – as is, it seems my PRs mostly languish until Dom has time to hand-test them, which is absurd.

Boris has been working on supporting plane X/Y offsets (panning) with the T tiling format. He has it working on every other tile row, and I’m concerned that the HW may just be broken, since T scanout didn’t see a whole lot of testing as far as I know.

I wrote a patchset to core DRM to let drivers control the ordering of calling into bridges, which is important for DSI where you want to have the bridge prepare be before video packets are scanned out, but after the module is ready to send DSI transactions. This should let more DSI panels work with the Raspberry Pi, but the patch needs a rework after review.

June 15, 2018
After 8 years of doing graphics driver development in i915, i965, and misc for Intel, I have decided to take on other things. I will remain at Intel, and be working on… FreeBSD development. It’s a pretty vague gig at the moment, so I’d be happy to hear of areas in FreeBSD that you think […]
As I mentioned two weeks ago, I’ve transitioned into a new role at Intel. The team is very new and so a lot of my part right now is helping out in organizing the game plan. Last week I attended BSDCan 2018 as well as the FreeBSD dev summit. That trip in addition to feedback […]
June 13, 2018

This post does not describe a configuration system. If that's all you care about, read this post here and go be angry at someone else. Anyway, with that out of the way let's get started.

For a long time, libinput has supported model quirks (first added in Apr 2015). These model quirks are bitflags applied to some devices so we can enable special behaviours in the code. Model flags can be very specific ("this is a Lenovo x230 Touchpad") or generic ("This is a trackball") and it just depends on what the specific behaviour is that we need. The x230 touchpad for example has a custom pointer acceleration but trackballs are marked so they get some config options mice don't have/need.

In addition to model tags we also have custom attributes. These are free-form and provide information that we cannot get from the kernel. These too can be specific ("this model needs a pressure threshold of N") or generic ("bluetooth keyboards are an external keyboards").

Overall, it's a good system. Most users never have to care that we even have this. The whole point is that any device-specific quirks need to be merged only once for each model, then everyone with the same device gets to benefit on the next update.

Originally quirks were hardcoded but this required rebuilding libinput for any changes. So we moved this to utilise the udev hwdb. For the trivial work of fetching udev properties we got a lot of flexibility in how we can match against devices. For example, an entry may look like this:


libinput:name:*AlpsPS/2 ALPS GlidePoint:dmi:*svnDellInc.:pnLatitudeE6220:*
LIBINPUT_ATTR_PRESSURE_RANGE=100:90
The above uses a name match and the dmi modalias match to apply a property for the touchpad on the Dell Latitude E6330. The exact match format is defined by a bunch of udev rules that ship as part of libinput.

Using the udev hwdb maked the quirk storage a plaintext file that can be updated independently of libinput, including local overrides for testing things before merging them upstream. Having said that, it's definitely not public API and can change even between stable branch updates as properties are renamed or rescoped to fit the behaviour more accurately. For example, a model-specific tag may be renamed to a behaviour-specific tag as we find more devices affected by the same issue.

The main issue with the quirks now is that we keep accumulating more and more of them and I'm starting to hit limits with the udev hwdb match behaviour. The hwdb is great for single matches but not so great for cascading matches where one match may overwrite another match. The hwdb match system is largely implementation-defined so it's not always predictable which match rule wins out in the end.

Second, debugging the udev hwdb is not at all trivial. It's a bit like git - once you're used to it it's just fine but until then the air turns yellow with all the swearing being excreted by the unsuspecting user.

So long story short, libinput 1.12 will replace the hwdb model quirks database with a set of .ini files. The model quirks will be installed in /usr/share/libinput/ or whatever prefix your distribution prefers instead. It's a bunch of files with fairly simplistic instructions, each [section] has a set of MatchFoo=Bar directives and the ModelFoo=bar or AttrFoo=bar tags. See this file for an example. If all MatchFoo directives apply to a device, the Model and Attr tags are applied. Matching works in inter- and intra-file sequential order so the last section in a file overrides the first section of that file and the highest-sorting file overrides the lowest-sorting file. Otherwise the tags are accumulated, so if two files match on the same device with different tags, both tags are applied. So far, so unexciting.

Sometimes it's necessary to install a temporary local quirk until upstream libinput is updated or the distribution updates its package. For this, the /etc/libinput/local-overrides.quirks file is read in as well (if it exists). Note though that the config files are considered internal API, so any local overrides may stop working on the next libinput update. Should've upstreamed that quirk, eh?

These files give us the same functionality as the hwdb - we can drop in extra files without recompiling. They're more human-readable than a hwdb match and it's a lot easier to add extra match conditions to it. And we can extend the file format at will. But the biggest advantage is that we can quite easily write debugging tools to figure out why something works or doesn't work. The libinput list-quirks tool shows what tags apply to a device and using the --verbose flag shows you all the files and sections and how they apply or don't apply to your device.

As usual, the libinput documentation has details.

June 12, 2018
Fingerprint readers are more and more common on Windows laptops, and hardware makers would really like to not have to make a separate SKU without the fingerprint reader just for Linux, if that fingerprint reader is unsupported there.

The original makers of those fingerprint readers just need to send patches to the libfprint Bugzilla, I hear you say, and the problem's solved!

But it turns out it's pretty difficult to write those new drivers, and those patches, without an insight on how the internals of libfprint work, and what all those internal, undocumented APIs mean.

Most of the drivers already present in libfprint are the results of reverse engineering, which means that none of them is a best-of-breed example of a driver, with all the unknown values and magic numbers.

Let's try to fix all this!

Step 1: fail faster

When you're writing a driver, the last thing you want is to have to wait for your compilation to fail. We ported libfprint to meson and shaved off a significant amount of time from a successful compilation. We also reduced the number of places where new drivers need to be declared to be added to the compilation.

Step 2: make it clearer

While doxygen is nice because it requires very little scaffolding to generate API documentation, the output is also not up to the level we expect. We ported the documentation to gtk-doc, which has a more readable page layout, easy support for cross-references, and gives us more control over how introductory paragraphs are laid out. See the before and after for yourselves.

Step 3: fail elsewhere

You created your patch locally, tested it out, and it's ready to go! But you don't know about git-bz, and you ended up attaching a patch file which you uploaded. Except you uploaded the wrong patch. Or the patch with the right name but from the wrong directory. Or you know git-bz but used the wrong commit id and uploaded another unrelated patch. This is all a bit too much.

We migrated our bugs and repository for both libfprint and fprintd to Freedesktop.org's GitLab. Merge Requests are automatically built, discussions are easier to follow!

Step 4: show it to me

Now that we have spiffy documentation, unified bug, patches and sources under one roof, we need to modernise our website. We used GitLab's CI/CD integration to generate our website from sources, including creating API documentation and listing supported devices from git master, to reduce the need to search the sources for that information.

Step 5: simplify

This process has started, but isn't finished yet. We're slowly splitting up the internal API between "internal internal" (what the library uses to work internally) and "internal for drivers" which we eventually hope to document to make writing drivers easier. This is partially done, but will need a lot more work in the coming months.

TL;DR: We migrated libfprint to meson, gtk-doc, GitLab, added a CI, and are writing docs for driver authors, everything's on the website!
June 07, 2018

So you have probably noticed by now that we started offering some 3rd party software in the latest Fedora Workstation namely Google Chrome, Steam, NVidia driver and PyCharm. This has come about due to a long discussion in the Fedora community on how we position Fedora Workstation and how we can improve our user experience. The principles we base of this policy you can read up on in this policy document. To sum it up though the idea is that while the Fedora operating system you install will continue as it has been for the last decade to be based on only free software (with an exception for firmware) you will be able to more easily find and install the plethora of applications out there through our software store application, GNOME Software. We also expect that as the world of Linux software moves towards containers in general and Flatpaks specifically we will have an increasing number of these 3rd party applications available in Fedora.

So the question I know some of you will have is, what do one actually have to do in order to get a 3rd party application listed in Fedora Workstation? Well wonder no longer as we put up a few documents now outlining the steps you will need to take. Compared to traditional linux packaging the major difference in the requirements around improved metadata for your application, so we are covering that aspect in special detail. These documents cover both RPMS and Flatpaks.

First of all you can get a general overview from our 3rd Party guidelines document. This document also explains how you submit a request to the Fedora Workstation Working group for your application to be added.

Then if you want to dig into the details of what metadata you need to create for your application there is the in-depth metadata tutorial here and finally once you are ready to set up your repository there is a tutorial explaining how you ensure your repository is able to provide the metadata you created above.

We expect to add more and more applications to Fedora Workstation over time here, and I would especially recommend that you look into Flatpaking your 3rd party application as it will decouple your application from the host operating system and thus decrease the workload on you maintaining your application for use in Fedora Workstation (and elsewhere).

This time we talk trackpoints. Or pointing sticks, or whatever else you want to call that thing between the GHB keys. If you don't have one and you've never seen one, prepare to be amazed. [1]

Trackpoints are tiny joysticks that react to pressure [2], convert that pressure into relative x/y events and pass that on to whoever is interested in it. The harder you push, the higher the deltas. This is where the simple and obvious stops and it gets difficult. But then again, if it was that easy I wouldn't write this post, you wouldn't have anything to read, so somehow everyone wins. Whoop-dee-doo.

All the data and measurements below refer to my trackpoint, a Lenovo T440s. It may not apply to any other trackpoints, including those on on different laptop models or even on the same laptop model with different firmware versions. I've written the below with a lot of cringing and handwringing. I want to write data that is irrefutable, but the universe is against me and what the universe wants, the universe gets. Approximately every second sentence below has a footnote of "actual results may vary". Feel free to re-create the data on your device though.

Measuring trackpoint range is highly subjective, so you'll have to trust me when I describe how specific speeds/pressure ranges feel. There are three ranges of pressure on my trackpoint (sort-of):

  • Pressure range one: When resting the finger on the trackpoint I don't really need to apply noticable pressure to make the trackpoint send events. Just moving the finger on the trackpoint makes it send events, albeit sporadically.
  • Pressure range two: Going beyond range one requires applying real pressure and feels to me like we're getting into RSI territory. Not a problem for short periods, but definitely not something I'd want all the time. It's the pressure I'd use to cross the screen.
  • Pressure range three: I have to push hard. I definitely wouldn't want to do this during everyday interaction and it just feels wrong anyway. This pressure range is for testing maximum deltas, not one you would want to use otherwise.
The first/second range are easier delineated than the second/third range because going from almost no pressure to some real pressure is easy. Going from some pressure to too much pressure is more blurry, there is some overlap between second and third range. Either way, keep these ranges in mind though as I'll be using them in the explanations below.

Ok, so with the physical conditions explained, let's look at what we have to worry about in software:

  • It is impossible to provide a constant input to a trackpoint if you're a puny human. Without a robotic setup you just cannot apply constant pressure so any measurements have some error. You also get to enjoy a feedback loop - pressure influences pointer motion but that pointer motion influences how much pressure you inadvertently apply. This makes any comparison filled with errors. I don't know if I'm applying the same pressure on the two devices I'm testing, I don't know if a user I'm asking to test something uses constant/the same/the right pressure.
  • Not all trackpoints are created equal. Some trackpoints (mostly in Lenovos), have configurable sensibility - 256 levels of it. [3] So one trackpoint measured does not equal another trackpoint unless you keep track of the firmware-set sensibility. Those trackpoints also have other toggles. More importantly and AFAIK, this type of trackpoint also has a built-in acceleration curve. [4] Other trackpoints (ALPS) just have a fixed sensibility, I have no idea whether those have a built-in acceleration curve or merely have a linear-ish pressure->delta mappings.

    Due to some design choices we did years ago, systemd increases the sensitivity on some devices (the POINTINGSTICK_SENSITIVITY property). So even on a vanilla install, you can't actually rely on the trackpoint being set to the manufacturer default. This was in an attempt to make trackpoints behave more consistently, systemd had the hwdb and it seemed like the right place to put device-specific quirks. In hindsight, it was the wrong design choice.
  • Deltas are ... unreliable. At high sensitivity and high pressures you might get a sequence of [7, 7, 14, 8, 3, 7]. At lower pressure you get the deltas at seemingly random intervals. This could be because it's hard to keep exact constant pressure, it could be a hardware issue.
  • evdev has been the default driver for almost a decade and before that it was the mouse driver for a long time. So the kernel will "Divide 4 since trackpoint's speed is too fast" [sic] for some trackpoints. Or by 8. Or not at all. In other words, the kernel adjusts for what the default user space is and userspace is based on what the kernel provides. On the newest ALPS trackpoints the kernel has stopped doing any in-kernel scaling (good!) but that means that the deltas are out by a factor of 8 now.
  • Trackpoints don't always have the same pressure ranges for x/y. AFAICT the y range is usually a bit less than the x range on many or most trackpoints. A bit weird because the finger position would suggest that strong vertical pressure is easier to apply than sideways pressure.
  • (Some? All?) Trackpoints have built-in calibration procedures to find and set their own center-point. Without that you'll get the trackpoint eventually being ever so slightly off center over time, causing a mouse pointer that just wanders off the screen, possibly into the woods, without the obligatory red cape and basket full of whatever grandma eats when she's sick.

    So the calibration is required but can be triggered accidentally by the user: If you push with the same pressure into the same direction for 2-5 seconds (depending on $THINGS) you trigger the calibration procedure and the current position becomes the new center point. When you release, the cursor wanders off for a few seconds until the calibration sets things straight again. If you ever see the cursor buzz off in a fixed direction or walking backwards for a centimetre or two you've triggered that calibration. The only way to avoid this is to make sure the pointer acceleration mechanism allows you to reach any target within 2 seconds and/or never forces you to apply constant pressure for more than 2 seconds. Now there's a challenge...

Ok. If you've been paying attention instead of hoping for a TLDR that's more elusive than Godot, we're now aware of the various drawbacks of collecting data from a trackpoint. Let's go and look at data. Sensitivity is set to the kernel default of 128 in sysfs, the default reporting rate is 100Hz. All observations are YMMV and whatnot, especially the latter.

Trackpoint deltas are in integers but the dynamic range of delta values is tiny. You mostly get 1 or 2 and it requires quite a fair bit of pressure to get up to 5 or more. At low pressure you get deltas of 1, but less frequently. Visualised, the relationship between deltas and the interval between deltas is like this:

At low pressure, we get deltas of 1 but high intervals. As the pressure increases, the interval between events shrinks until at some point the interval between events matches the reporting rate (100Hz/10ms). Increasing the pressure further now increases the deltas while the intervals remain at the reporting rate. For example, here's an event sequence at low pressure:

E: 63796.187226 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +20ms
E: 63796.227912 0002 0001 0001 # EV_REL / REL_Y 1
E: 63796.227912 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +40ms
E: 63796.277549 0002 0000 -001 # EV_REL / REL_X -1
E: 63796.277549 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +50ms
E: 63796.436793 0002 0000 -001 # EV_REL / REL_X -1
E: 63796.436793 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +159ms
E: 63796.546114 0002 0001 0001 # EV_REL / REL_Y 1
E: 63796.546114 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +110ms
E: 63796.606765 0002 0000 -001 # EV_REL / REL_X -1
E: 63796.606765 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +60ms
E: 63796.786510 0002 0000 -001 # EV_REL / REL_X -1
E: 63796.786510 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +180ms
E: 63796.885943 0002 0001 0001 # EV_REL / REL_Y 1
E: 63796.885943 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +99ms
E: 63796.956703 0002 0000 -001 # EV_REL / REL_X -1
E: 63796.956703 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +71ms
This was me pressing lightly but with perceived constant pressure and the time stamps between events go from 20m to 180ms. Remember what I said above about unreliable deltas? Yeah, that.

Here's an event sequence from a trackpoint at a pressure that triggers almost constant reporting:


E: 72743.926045 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.926045 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.926045 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 72743.939414 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.939414 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.939414 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +13ms
E: 72743.949159 0002 0000 -002 # EV_REL / REL_X -2
E: 72743.949159 0002 0001 -002 # EV_REL / REL_Y -2
E: 72743.949159 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 72743.956340 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.956340 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.956340 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +7ms
E: 72743.978602 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.978602 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.978602 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +22ms
E: 72743.989368 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.989368 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.989368 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +11ms
E: 72743.999342 0002 0000 -001 # EV_REL / REL_X -1
E: 72743.999342 0002 0001 -001 # EV_REL / REL_Y -1
E: 72743.999342 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 72744.009154 0002 0000 -001 # EV_REL / REL_X -1
E: 72744.009154 0002 0001 -001 # EV_REL / REL_Y -1
E: 72744.009154 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +10ms
E: 72744.018965 0002 0000 -002 # EV_REL / REL_X -2
E: 72744.018965 0002 0001 -003 # EV_REL / REL_Y -3
E: 72744.018965 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- +9ms
Note how there is an events in there with 22ms? Maintaining constant pressure is hard. You can re-create the above recordings by running evemu-record.

Pressing hard I get deltas up to maybe 5. That's staying within the second pressure range outlined above, I can force higher deltas but what's the point. So the dynamic range for deltas alone is terrible - we have a grand total of 5 values across the comfortable range.

Changing the sensitivity setting higher than the default will send higher deltas, including deltas greater than 1 before reaching the report rate. Setting it to lower than the default (does anyone do that?) sends smaller deltas. But doing so means changing the hardware properties, similar to how some gaming mice can switch dpi on the fly.

I leave you with a fun thought exercise in correlation vs. causation: your trackpoint uses PS/2, your touchpad probably uses PS/2. Your trackpoint has a reporting rate of 100Hz but when you touch the touchpad half the bandwidth is used by the touchpad. So your trackpoint sends half the events when you have the palm resting on the touchpad. From my observations, the deltas don't double in size. In other words, your trackpoint just slows down to roughly half the speed. I can reduce the reporting rate to approximately a third by putting two or more fingers onto the touchpad. Trackpoints haven't changed that much over the years but touchpads have. So the takeway is: 10 years ago touchpads were smaller and trackpoints were faster. Simply because you could use them without touching the touchpad. Mind blown (if true, measuring these things is hard...)

Well, that was fun, wasn't it. I'm glad you stayed that long, because I did and it'd feel lonely otherwise. In the next post I'll outline the pointer acceleration curves for trackpoints and what we're going to to about that. Besides despairing, that is.

[1] I doubt you will be, but it always pays to be prepared.
[2] In this post I'm using "pressure" here as side-ways pressure, not downwards pressure. Some trackpoints can handle downwards pressure and modify the acceleration based on it (or expect userland to do so).
[3] Not that this number is always correct, the Lenovo CompactKeyboard USB with Trackpoint has a default sensibility of 5 - any laptop trackpoint would be unusable at that low value (their default is 128).
[4] I honestly don't know this for sure but ages ago I found a hw spec document that actually detailed the process. Search for ""TrackPoint System Version 4.0 Engineering Specification", page 43 "2.6.2 DIGITAL TRANSFER FUNCTION"
June 06, 2018

Thanks to Daniel Stone's efforts, libinput is now on gitlab. For a longer explanation on the move from the old freedesktop infrastructure (cgit, bugzilla, etc.) to the gitlab instance hosted by freedesktop.org, see this email.

All open bugs have been migrated from bugzilla to gitlab too, the documentation has been updated acccordingly, and we're ready to go. The new base URL for libinput in gitlab is: https://gitlab.freedesktop.org/libinput/.

June 04, 2018

The quiescent current of the APA102 2020 LEDs is about 0.7-1.0mA. Which is rather high for some usecases. But recently some new option have been made available.

The APA102 1515 and the APA104 1515, both of which come in a 1.5x1.5mm package. Additionally the APA104 1515 IC has a quiescent current of 0.3mA, which is good step in the right direction.

APA102 1515 datasheet

APA104 1515 datasheet

Unfortunately only the APA104 datasheet specifies quiescent current, so the quiescent current of the APA102 15155 is still an unknown.

May 30, 2018

Early this month, I started on the upstreaming the prerequisite patches for Dave Stevenson’s MMAL V4L2 codec driver. The Kodi developers have tested that their pipeline works on the KMS backend using that driver, so once we get this upstream it should mean full media decode acceleration support for Kodi and gstreamer-based applications. All the prep patches were quickly accepted by gregkh, and now the MMAL camera driver will automatically probe at boot time when enabled.

The next week, I took a trip out to Cambridge to meet up with GNOME developers to work on performance issues on their software stack with lower-spec devices.

My primary task I adopted during the hackfest was building up support for capturing DRM events into sysprof. For anyone who hasn’t used it, sysprof is a beautiful tool for doing systemwide sampling profiling. Its maintainer, Christian Hergert, has been building up some timeline infrastructure in it, so you can see CPU utilization over time and look at samples within a chosen time range. By including some useful perf tracepoints in our captures too, we can see when vblank is and when GPU jobs were running relative to it.

vc4 timeline

That’s the first cut of throwing the events from a Raspberry Pi running fullscreen glxgears into the UI – you’ve got the GPU’s binner on top, then the GPU’s renderer, and the vblank event at the bottom. These aren’t great yet, but for a couple of days of hacking it’s a good start. We should be able to do AMDGPU, V3D, and etnaviv timelines easily as well, but i915 unfortunately hides the necessary tracepoints under an off-by-default Kconfig option.

There exist other tools that capture GPU timelines already, like gpuvis. However, gpuvis seems to be composed of a bunch of shell scripts and commands on a wiki, and isn’t packaged on debian. Incorporating support for DRM events into sysprof means that anyone updating the package will get this code, complete with nice authentication for getting the privileges necessary to start profiling. While I was there, Mathias worked on having GTK provide tracepoints from some of its notable codepaths, and Jonas worked on having gnome-shell get tracepoints as well. Hopefully, all of these tracepoints glued together can give us some visualization of what went wrong across the system when we miss a vblank.

Ultimately, though, I’m skeptical of GNOME 3 ever being usable on a Raspberry Pi. The clutter-based gnome-shell painting is too slow (60% of a CPU burned in the shell just trying to present a single 60fps glxgears), and there doesn’t seem to be a plan for improving it other than “maybe we’ll delete clutter some day?” Also, the javascipt extension system being in the compositor thread means that you drop application frames when something else (network statechanges, notifications, etc) happens in the system. This was a bad software architecture choice, and digging out of that hole now would take a long time. (I’m agnostic on whether it was wrong to move those into the same process as the compositor, but same thread was definitely wrong). I’ll keep working on the debugging tools to try to enable anyone to work on these problems, though.

After the hackfest, I stopped by the Raspberry Pi offices to talk multimedia again. Dave Stevenson’s going to take on upstreaming the v4l2 codec driver, having seen how easy the camera upstreaming to staging was. I respun the SAND display patches to get them upstreamed, so we can start talking about SAND with V4L2 soon. I also did some review of the core DRM writeback connector patches, so that hopefully we can push Boris’s VC4 support and be able to use that from things like HWC2 to do scene flattening.

While at the hotel, I tried out Stefan Schake’s vc4 fencing patches, wrote a couple of little fixes, and pushed the lot to Mesa.

For V3D, I wrote a patch series for supporting a bunch of GLES3 bitfield conversion operations on HW with no explicit instructions for them. Unfortunately, as the first one with hardware needing them, there hasn’t been any review yet. I also implemented glSampleMask/glSampleCoverage, and fixed some NaN bugs. Finally, I pushed the patches enabling the Mesa driver by default now that the kernel side has been accepted upstream!

After that busy couple of weeks, I took a week off at Portland’s regional burn, and I’m back to work now.

May 29, 2018

I am very happy to see that Benjamin Tissoires work to enable the Dell Canvas and Totem has started to land in the upstream kernel. This work is the result of a collaboration between ourselves at Red Hat and Dell to bring this exciting device to Linux users.

Dell Canvas 27

Dell Canvas

The Dell Canvas and totem is essentially a graphics tablet with a stylus and also a turnable knob that can be placed onto the graphics tablet. Dell feature some videos on their site showcasing the Dell Canvas being used in ares such as drawing, video editing and CAD.

So for Linux applications supporting graphic drawing tablets already the canvas should work once this lands, but where we hope to see applications developers step up is adding support in their application for the totem. I have been pondering how we could help make that happen as we would be happy to donate a Dell Canvas to help kickstart application support, I am just unsure about the best way to go ahead.
I was considering offering one as a prize for the first application to add support for the totem, but that seems to be a chicken and egg problem by definition. If anyone got any suggestions for how to get one of these into the hands of the developer most interested and able to take advantage of it?

libinput 1.11 is just around the corner and one of the new features added are the libinput-record and libinput-replay tools. These are largely independent of libinput itself (libinput-replay is a python script) and replace the evemu-record and evemu-replay tools. The functionality is roughly the same with a few handy new features. Note that this is a debugging tool, if you're "just" a user, you may never have to use either tool. But for any bug report expect me to ask for a libinput-record output, same as I currently ask everyone for an evemu recording.

So what does libinput-record do? Simple - it opens an fd to a kernel device node and reads events from it. These events are converted to YAML and printed to stdout (or the provided output file). The output is a combination of machine-readable information and human-readable comments. Included in the output are the various capabilities of the device but also some limited system information like the kernel version and the dmi modalias. The YAML file can be passed to libinput-replay, allowing me to re-create the event device on my test machines and hopefully reproduce the bug. That's about it. evemu did exactly the same thing and it has done wonders for how efficiently we could reproduce and fix bugs.

Alas, evemu isn't perfect. It's becoming 8 years old now and its API is a bit crufty. Originally two separate tools generated two separate files (machine-readable only), two different tools for creating the device and playing events. Over the years it got more useful. Now we only have one tool each to record or replay events and the file includes human-readable comments. But we're hitting limits, its file format is very inflexible and the API is the same. So we'd have to add a new file format and the required parsing, break the API, deal with angry users, etc. Not worth it.

Thus libinput-record is the replacement for evemu. The main features that libinput-record adds are a more standardised file format that can be expanded and parsed easily, the ability to record and replay multiple devices at once and the interleaving of evdev events with libinput events to check what's happening. And it's more secure by default, all alphanumeric keys are (by default) printed as KEY_A so there's no risk of a password leaking into a file attached to Bugzilla. evemu required python bindings, for libinput-record's output format we don't need those since you can just access YAML as array in Python. And finally - it's part of libinput which means it's going to be easier to install (because distributions won't just ignore libinput) and it's going to be more up-to-date (because if you update libinput, you get the new libinput-record).

It's new code so it will take a while to iron out any leftover bugs but after that it'll be the glorious future ;)

May 20, 2018

The All Systems Go! 2018 Call for Participation is Now Open!

The Call for Participation (CFP) for All Systems Go! 2018 is now open. We’d like to invite you to submit your proposals for consideration to the CFP submission site.

ASG image

The CFP will close on July 30th. Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.

For more information please visit our conference website!

May 17, 2018

A textured cube with a reflection

The libre Midgard driver, Panfrost has reached a number of new milestones, culminating in the above screenshot, demonstrating:

  • Textures! The bug preventing textures was finally cracked. I implemented support for textures in the compiler and through Gallium; I integrated some texture swizzling from limare’s source code; et voila, textures.

  • Multiple shaders! Previously, the Midgard, NIR-based compiler and the Gallium driver were separate; the compiler read GLSL from the disk, writing back compiled binaries to disk, and the driver would read these binaries. While this path is still used in the standalone test driver, the Gallium driver is now integrated with the compiler directly, enabling “online compilation”. Among numerous other benefits, multiple shaders can now be used in the same program.

  • The stencil test in the Gallium driver. The scissor test could have been used instead, but the stencil test can generalise to non-rectangular surfaces. Additionally, the mechanics of the stencil test on Midgard hardware is better understood than the scissor test for the time being.

  • Blending (partial support), again through Mesa. Currently, only fixed-function blending is supported; implementing “blend shaders” is low-priority due to their rarity, complexity, and performance.

  • My love for My Little Pony. Screenshot is CC BY-SA, a derivative of PonyToast’s photograph of the Element of Generousity’s voice actress.

Warning: the following is ruthlessly technical and contains My Little Pony references. Proceed at your own risk.

Textures are by far the most significant addition. Although work decoding their command stream and shader instructions had commenced months ago, we hadn’t managed to replay a texture until May 3, let alone implement support in the driver. The lack of functional textures was the only remaining showstopper. We had poured in long hours debugging it, narrowing down the problem to the command stream, but nothing budged. No permutation of the texture descriptor or the sampler descriptor changed the situation. Yet, everyone was sure that once we figured it out, it would have been something silly in hindsight.

It was.

OpenGL’s textures in the command stream are controlled by the texture and sampler descriptors, corresponding to Vulkan’s textures and samplers respectively. They were the obvious place to look for bugs.

They were not the culprit.

Where did the blame lie, then?

The shader descriptor.

Midgard’s shader descriptor, a block in the command stream which configures a shader, has a number of fields: the address of the compiled shader binary, the number of registers used, the number of attributes/varyings/uniforms, and so forth. I thought that was it. A shader descriptor from a replay with textures looked like this (reformatted for clarity):

struct mali_shader_meta shader_meta = {
    .shader = (shader_memory + 1920) | 5,
    // XXX shader zero tripped
    .attribute_count = 0,
    .varying_count = 1,
    .uniform_registers = (0 << 20) | 0x20e00,
};

That is, the shader code is at shader_memory + 1920; as a fragment shader, it uses no attributes but it does receive a single varying; it does not use any uniforms. All accounted for, right?

What’s that comment about, “XXX shader zero tripped”, then?

There are frequently fields in the command stream that we observe to always be zero, for various reasons. Sometimes they are there for padding and alignment. Sometimes they correspond to a feature that none of our tests had used yet. In any event, it is distracting for a command stream log to be filled with lines like:

.zero0 = 0,
.zero1 = 0,
.zero2 = 0,
.zero3 = 0,

In an effort to keep everything tidy, fields that were observed to always be zero are not printed. Instead, the tracer just makes sure that the unprinted fields (which default to zero by the C compiler) are, in fact, equal to zero. If they are not, a warning is printed, stating that a “zero is tripped”, as if the field were a trap. When the reader of the log sees this line, they know that the replay is incomplete, as they are missing a value somewhere; a field was wrongly marked as “always zero”. It was a perfect system.

At least, it would have been a perfect system, if I had noticed the warning.

I was hyper-focused on the new texture and sampler descriptors, on the memory allocations for the texture memory itself, on the shader binaries – I was hyper-focused on textures that I only skimmed the rest of the log for anomalies.

If I had – when I finally did, on that fateful Thursday – I would have realised that the zero was tripped. I would have committed a change like:

-               if (t->zero1)
-                       panwrap_msg("XXX shader zero tripped\n");
+               //if (t->zero1)
+               //      panwrap_msg("XXX shader zero tripped\n");
 
+               panwrap_prop("zero1 = %" PRId16, t->zero1);
        panwrap_prop("attribute_count = %" PRId16, t->attribute_count);
        panwrap_prop("varying_count = %" PRId16, t->varying_count);

I would have then discovered that “zero1” was mysteriously equal to 65537 for my sample with a texture. And I would have noticed that suddenly, texture replay worked!

Everything fell into place from then. Notice that 65537 in decimal is equal to 0x10001 in hex. With some spacing included for clarity, that’s 0x 0001 0001. Alternatively, instead of a single 32-bit word, it can be interpreted as two 16-bit integers: two ones in succession. What two things do we have one of in the command stream?

Textures and samplers!

Easy to enough to handle in the command stream:

    mali_ptr shader;
-       u32 zero1;
+
+       u16 texture_count; 
+       u16 sampler_count;
 
    /* Counted as number of address slots (i.e. half-precision vec4's) */
    u16 attribute_count;

After that, it was just a matter of moving code from the replay into the real driver, writing functions to translate Gallium commands into Midgard structures, implementing a routine in the compiler to translate NIR instructions to Midgard instructions, and a lot of debugging. A week later, all the core code for textures was in place… almost.

The other big problem posed by textures is their internal format. In some graphics systems, textures are linear, the most intuitive format; that is, a pixel is accessed in the texture by texture[y*stride + x]. However, for reasons of cache locality, this format is a disaster for a GPU; instead, textures are stored “tiled” or “swizzled”. This article offers a good overview of tiled texture layouts.

Texture tiling is great and powerful for hardware. It is less great and powerful for driver writers. Decoding the swizzling algorithm would have been a mammoth task, orthogonal to the command stream and shader work for textures. 3D drivers are complex – textures have three major components that are each orthogonal to each other.

It would have been hopeless… if libv had not already decoded the layout when writing limare! The heavy lifting was done, all released under the MIT license. In an afternoon’s work, I extracted the relevant code from limare, cleaned it up a bit, and made it up about 20% faster (Abstract rounding). The resulting algorithm is still somewhat opaque, but it works! In a single thread on my armv7 RK3288 laptop, about 355, RGBA32 1080p textures can be swizzled in 10 seconds flat.

I then integrated the swizzling code with the Gallium driver, et voilà, vraimente– non, non, ce fois, c’est vrai – je ne mens pas! – euh, bon, je doivais finir d’autres tâches avant pouvoir démontrer test-cube-textured, mais…. voilà! (Sorry for speaking Prancy.)

Textures?

Textures!


On the Bifrost side, Lyude Paul has continued her work writing an assembler. The parser, a surprisingly complex task given the nuances of the ISA, is now working reliably. Code emission is in nascent stages, and her assembler is now making progress on instruction encoding. The first instructions have almost been emitted. May many more instructions follow.

However, an assembler for Bifrost is no good without a free driver to use it with; accordingly, Connor Abbott has continued his work investigating the Bifrost command stream. It continues to demonstrate considerable similarities to Midgard; luckily, much of the driver code will be shareable between the architectures. Like the assembler, this work is still in early stages, implemented in a personal branch, but early results look promising.

And a little birdy told me that there might be T880 support in the pipes.

May 14, 2018

This post is part of a four part series: Part 1, Part 2, Part 3, Part 4.

In the first three parts, I covered the X server and synaptics pointer acceleration curves and how libinput compares to the X server pointer acceleration curve. In this post, I will compare libinput to the synaptics acceleration curve.

Comparison of synaptics and libinput

libinput has multiple different pointer acceleration curves, depending on the device. In this post, I will only consider the one used for touchpads. So let's compare the synaptics curve with the libinput curve at the default configurations:

But this one doesn't tell the whole story, because the touchpad accel for libinput actually changes once we get faster. So here are the same two curves, but this time with the range up to 1000mm/s. These two graphs show that libinput is both very different and similar. Both curves have an acceleration factor less than 1 for the majority of speeds, they both decelerate the touchpad more than accelerating it. synaptics has two factors it sticks to and a short curve, libinput has a short deceleration curve and its plateau is the same or lower than synaptics for the most part. Once the threshold is hit at around 250 mm/s, libinput's acceleration keeps increasing until it hits a maximum much later.

So, anything under ~20mm/s, libinput should be the same as synaptics (ignoring the <7mm/s deceleration). For anything less than 250mm/s, libinput should be slower. I say "should be" because that is not actually the case, synaptics is slower so I suspect the server scaling slows down synaptics even further. Hacking around in the libinput code, I found that moving libinput's baseline to 0.2 matches the synaptics cursor's speed. However, AFAIK that scaling depends on the screen size, so your mileage may vary.

Comparing configuration settings

Let's overlay the libinput speed toggles. In Part 2 we've seen the synaptics toggles and they're open-ended, so it's a bit hard to pick a specific set to go with to compare. I'll be using the same combined configuration options from the diagram there.

And we need the diagram from 0-1000mm/s as well. There isn't much I can talk about here in direct comparison, the curves are quite different and the synaptics curves vary greatly with the configuration options (even though the shape remains the same).

Analysis

It's fairly obvious that the acceleration profiles are very different once depart from the default settings. Most notable, only libinput's slowest speed setting matches the 0.2 speed that is the synaptics default setting. In other words, if your touchpad is too fast compared to synaptics, it may not be possible to slow it down sufficiently. Likewise, even at the fastest speed, the baseline is well below the synaptics baseline for e.g. 0.6 [1], so if your touchpad is too slow, you may not be able to speed it up sufficiently (at least for low speeds). That problem won't exist for the maximum acceleration factor, the main question here is simply whether they are too high. Answer: I don't know.

So the base speed of the touchpad in libinput needs a higher range, that's IMO a definitive bug that I need to work on. The rest... I don't know. Let's see how we go.

[1] A configuration I found suggested in some forum when googling for MinSpeed, so let's assume there's at least one person out there using it.

This post is part of a four part series: Part 1, Part 2, Part 3, Part 4.

In Part 1 and Part 2 I showed the X server acceleration code as used by the evdev and synaptics drivers. In this part, I'll show how it compares against libinput.

Comparison to libinput

libinput has multiple different pointer acceleration curves, depending on the device. In this post, I will only consider the default one used for mice. A discussion of the touchpad acceleration curve comes later. So, back to the graph of the simple profile. Let's overlay this with the libinput pointer acceleration curve:

Turns out the pointer acceleration curve, mostly modeled after the xserver behaviour roughly matches the xserver behaviour. Note that libinput normalizes to 1000dpi (provided MOUSE_DPI is set correctly) and thus the curves only match this way for 1000dpi devices.

libinput's deceleration is slightly different but I doubt it is really noticeable. The plateau of no acceleration is virtually identical, i.e. at slow speeds libinput moves like the xserver's pointer does. Likewise for speeds above ~33mm/s, libinput and the server accelerate by the same amount. The actual curve is slightly different. It is a linear curve (I doubt that's noticeable) and it doesn't have that jump in it. The xserver acceleration maxes out at roughly 20mm/s. The only difference in acceleration is for the range of 10mm/s to 33mm/s.

30mm/s is still a relatively slow movement (just move your mouse by 30mm within a second, it doesn't feel fast). This means that for all but slow movements, the current server and libinput acceleration provides but a flat acceleration at whatever the maximum acceleration is set to.

Comparison of configuration options

The biggest difference libinput has to the X server is that it exposes a single knob of normalised continuous configuration (-1.0 == slowest, 1.0 == fastest). It relies on settings like MOUSE_DPI to provide enough information to map a device into that normalised range.

Let's look at the libinput speed settings and their effect on the acceleration profile (libinput 1.10.x).

libinput's speed setting is a combination of changing thresholds and accel at the same time. The faster you go, the sooner acceleration applies and the higher the maximum acceleration is. For very slow speeds, libinput provides deceleration. Noticeable here though is that the baseline speed is the same until we get to speed settings of less than -0.5 (where we have an effectively flat profile anyway). So up to the (speed-dependent) threshold, the mouse speed is always the same.

Let's look at the comparison of libinput's speed setting to the accel setting in the simple profile:

Clearly obvious: libinput's range is a lot smaller than what the accel setting allows (that one is effectively unbounded). This obviously applies to the deceleration as well: I'm not posting the threshold comparison, as Part 1 shows it does not effect the maximum acceleration factor anyway.

Analysis

So, where does this leave us? I honestly don't know. The curves are different but the only paper I could find on comparing acceleration curves is Casiez and Roussel' 2011 UIST paper. It provides a comparison of the X server acceleration with the Windows and OS X acceleration curves [1]. It shows quite a difference between the three systems but the authors note that no specific acceleration curve is definitely superior. However, the most interesting bit here is that both the Windows and the OS X curve seem to be constant acceleration (with very minor changes) rather than changing the curve shape.

Either way, there is one possible solution for libinput to implement: to change the base plateau with the speed. Otherwise libinput's acceleration curve is well defined for the configurable range. And a maximum acceleration factor of 3.5 is plenty for a properly configured mouse (generally anything above 3 is tricky to control). AFAICT, the main issues with pointer acceleration come from mice that either don't have MOUSE_DPI set or trackpoints which are, unfortunately, a completely different problem.

I'll probably also give the windows/OS X approaches a try (i.e. same curve, different constant deceleration) and see how that goes. If it works well, that may be a a solution because it's easier to scale into a large range. Otherwise, *shrug*, someone will have to come with a better solution.

[1] I've never been able to reproduce the same gain (== factor) but at least the shape and x axis seems to match.

This post is part of a four part series: Part 1, Part 2, Part 3, Part 4.

In Part 1 I showed the X server acceleration code as used by the evdev driver (which leaves all acceleration up to the server). In this part, I'll show the acceleration code as used by the synaptics touchpad driver. This driver installs a device-specific acceleration profile but beyond that the acceleration is... difficult. The profile itself is not necessarily indicative of the real movement, the coordinates are scaled between device-relative, device-absolute, screen-relative, etc. so often that it's hard to keep track of what the real delta is. So let's look at the profile only.

Diagram generation

Diagrams were generated by gnuplot, parsing .dat files generated by the ptrveloc tool in the git repo. Helper scripts to regenerate all data are in the repo too. Default values unless otherwise specified:

  • MinSpeed: 0.4
  • MaxSpeed: 0.7
  • AccelFactor: 0.04
  • dpi: 1000 (used for converting units to mm)
All diagrams are limited to 100 mm/s and a factor of 5 so they are directly comparable. From earlier testing I found movements above over 300 mm/s are rare, once you hit 500 mm/s the acceleration doesn't really matter that much anymore, you're going to hit the screen edge anyway.

The choice of 1000 dpi is a difficult one. It makes the diagrams directly comparable to those in Part 1but touchpads have a great variety in their resolution. For example, an ALPS DualPoint touchpad may have resolutions of 25-32 units/mm. A Lenovo T440s has a resolution of 42 units/mm over PS/2 but 20 units/mm over the newer SMBus/RMI4 protocol. This is the same touchpad. Overall it doesn't actually matter that much though, see below.

The acceleration profile

This driver has a custom acceleration profile, configured by the MinSpeed, MaxSpeed and AccelFactor options. The former two put a cap on the factor but MinSpeed also adjusts (overwrites) ConstantDeceleration. The AccelFactor defaults to a device-specific size based on the device diagonal.

Let's look at the defaults of 0.4/0.7 for min/max and 0.04 (default on my touchpad) for the accel factor:

The simple profile from part 1 is shown in this graph for comparison. The synaptics profile is printed as two curves, one for the profile output value and one for the real value used on the delta. Unlike the simple profile you cannot configure ConstantDeceleration separately, it depends on MinSpeed. Thus the real acceleration factor is always less than 1, so the synaptics driver doesn't accelerate as such, it controls how much the deltas are decelerated.

The actual acceleration curve is just a plain old linear interpolation between the min and max acceleration values. If you look at the curves closer you'll find that there is no acceleration up to 20mm/s and flat acceleration from 25mm/s onwards. Only in this small speed range does the driver adjust its acceleration based on input speed. Whether this is in intentional or just happened, I don't know.

The accel factor depends on the touchpad x/y axis. On my T440s using PS/2, the factor defaults to 0.04. If I get it to use SMBus/RMI4 instead of PS/2, that same device has an accel factor of 0.09. An ALPS touchpad may have a factor of 0.13, based on the min/max values for the x/y axes. These devices all have different resolutions though, so here are the comparison graphs taking the axis range and the resolution into account:

The diagonal affects the accel factor, so these three touchpads (two curves are the same physical touchpad, just using a different bus) get slightly different acceleration curves. They're more similar than I expected though and for the rest of this post we can get away we just looking at the 0.04 default value from my touchpad.

Note that due to how synaptics is handled in the server, this isn't the whole story, there is more coordinate scaling etc. happening after the acceleration code. The synaptics acceleration profile also does not acccommodate for uneven x/y resolutions, this is handled in the server afterwards. On touchpads with uneven resolutions the velocity thus depends on the vector, moving along the x axis provides differently sized deltas than moving along the y axis. However, anything applied later isn't speed dependent but merely a constant scale, so these curves are still a good representation of what happens.

The effect of configurations

What does the acceleration factor do? It changes when acceleration kicks in and how steep the acceleration is.

And how do the min/max values play together? Let's adjust MinSpeed but leave MaxSpeed at 0.7.

MinSpeed lifts the baseline (i.e. the minimum acceleration factor), somewhat expected from a parameter named this way. But it looks again like we have a bug here. When MinSpeed and MaxSpeed are close together, our acceleration actually decreases once we're past the threshold. So counterintuitively, a higher MinSpeed can result in a slower cursor once you move faster.

MaxSpeed is not too different here:

The same bug is present, if the MaxSpeed is smaller or close to MinSpeed, our acceleration actually goes down. A quick check of the sources didn't indicate anything enforcing MinSpeed < MaxSpeed either. But otherwise MaxSpeed lifts the maximum acceleration factor.

These graphs look at the options in separation, in reality users would likely configure both MinSpeed and MaxSpeed at the same time. Since both have an immediate effect on pointer movement, trial and error configuration is simple and straightforward. Below is a graph of all three adjusted semi-randomly:

No suprises in there, the baseline (and thus slowest speed) changes, the maximum acceleration changes and how long it takes to get there changes. The curves vary quite a bit though, so without knowing the configuration options, it's impossible to predict how a specific touchpad behaves.

Epilogue

The graphs above show the effect of configuration options in the synaptics driver. I purposely didn't put any specific analysis in and/or compare it to libinput. That comes in a future post.

This post is part of a four part series: Part 1, Part 2, Part 3, Part 4.

Over the last few days, I once again tried to tackle pointer acceleration. After all, I still get plenty of complaints about how terrible libinput is and how the world was so much better without it. So I once more tried to understand the X server's pointer acceleration code. Note: the evdev driver doesn't do any acceleration, it's all handled in the server. Synaptics will come in part two, so this here focuses mostly on pointer acceleration for mice/trackpoints.

After a few failed attempts of live analysis [1], I finally succeeded extracting the pointer acceleration code into something that could be visualised. That helped me a great deal in going back and understanding the various bits and how they fit together.

The approach was: copy the ptrveloc.(c|h) files into a new project, set up a meson.build file, #define all the bits that are assumed to be there and voila, here's your library. Now we can build basic analysis tools provided we initialise all the structs the pointer accel code needs correctly. I think I succeeded. The git repo is here if anyone wants to check the data. All scripts to generate the data files are in the repository.

A note on language: the terms "speed" and "velocity" are subtly different but for this post the difference doesn't matter. The code uses "velocity" but "speed" is more natural to talk about, so just assume equivalence.

The X server acceleration code

There are 15 configuration options for pointer acceleration (ConstantDeceleration, AdaptiveDeceleration, AccelerationProfile, ExpectedRate, VelocityTrackerCount, Softening, VelocityScale, VelocityReset, VelocityInitialRange, VelocityRelDiff, VelocityAbsDiff, AccelerationProfileAveraging, AccelerationNumerator, AccelerationDenominator, AccelerationThreshold). Basically, every number is exposed as configurable knob. The acceleration code is a product of a time when we were handing out configuration options like participation medals at a children's footy tournament. Assume that for the rest of this blog post, every behavioural description ends with "unless specific configuration combinations apply". In reality, I think only four options are commonly used: AccelerationNumerator, AccelerationDenominator, AccelerationThreshold, and ConstantDeceleration. These four have immediate effects on the pointer movement and thus it's easy to do trial-and-error configuration.

The server has different acceleration profiles (called the 'pointer transfer function' in the literature). Each profile is a function that converts speed into a factor. That factor is then combined with other things like constant deceleration, but eventually our output delta forms as:


deltaout(x, y) = deltain(x, y) * factor * deceleration
The output delta is passed back to the server and the pointer saunters over by few pixels, happily bumping into any screen edge on the way.

The input for the acceleration profile is a speed in mickeys, a threshold (in mickeys) and a max accel factor (unitless). Mickeys are a bit tricky. This means the acceleration is device-specific, the deltas for a mouse at 1000 dpi are 20% larger than the deltas for a mouse at 800 dpi (assuming same physical distance and speed). The "Resolution" option in evdev can work around this, but by default this means that the acceleration factor is (on average) higher for high-resolution mice for the same physical movement. It also means that that xorg.conf snippet you found on stackoverflow probably does not do the same on your device.

The second problem with mickeys is that they require a frequency to map to a physical speed. If a device sends events every N ms, delta/N gives us a speed in units/ms. But we need mickeys for the profiles. Devices generally have a fixed reporting rate and the speed of each mickey is the same as (units/ms * reporting rate). This rate defaults to 10 in the server (the VelocityScaling default value) and thus matches a device reporting at 100Hz (a discussion of this comes later). All graphs below were generated with this default value.

Back to the profile function and how it works: The threshold(usually) defines the mimimum speed at which acceleration kicks in. The max accel factor (usually) limits the acceleration. So the simplest algorithm is


if (velocity < threshold)
return base_velocity;
factor = calculate_factor(velocity);
if (factor > max_accel)
return max_accel;
return factor;
In reality, things are somewhere between this simple and "whoops, what have we done".

Diagram generation

Diagrams were generated by gnuplot, parsing .dat files generated by the ptrveloc tool in the git repo. Helper scripts to regenerate all data are in the repo too. Default values unless otherwise specified:

  • threshold: 4
  • accel: 2
  • dpi: 1000 (used for converting units to mm)
  • constant deceleration: 1
  • profile: classic
All diagrams are limited to 100 mm/s and a factor of 5 so they are directly comparable. From earlier testing I found movements above over 300 mm/s are rare, once you hit 500 mm/s the acceleration doesn't really matter that much anymore, you're going to hit the screen edge anyway.

Acceleration profiles

The server provides a number of profiles, but I have seen very little evidence that people use anything but the default "Classic" profile. Synaptics installs a device-specific profile. Below is a comparison of the profiles just so you get a rough idea what each profile does. For this post, I'll focus on the default Classic only.

First thing to point out here that if you want to have your pointer travel to Mars, the linear profile is what you should choose. This profile is unusable without further configuration to bring the incline to a more sensible level. Only the simple and limited profiles have a maximum factor, all others increase acceleration indefinitely. The faster you go, the more it accelerates the movement. I find them completely unusable at anything but low speeds.

The classic profile transparently maps to the simple profile, so the curves are identical.

Anyway, as said above, profile changes are rare. The one we care about is the default profile: the classic profile which transparently maps to the simple profile (SimpleSmoothProfile() in the source).

Looks like there's a bug in the profile formula. At the threshold value it jumps from 1 to 1.5 before the curve kicks in. This code was added in ~2008, apparently no-one noticed this in a decade.

The profile has deceleration (accel factor < 1 and thus decreasing the deltas) at slow speeds. This provides extra precision at slow speeds without compromising pointer speed at higher physical speeds.

The effect of config options

Ok, now let's look at the classic profile and the configuration options. What happens when we change the threshold?

First thing that sticks out: one of these is not like the others. The classic profile changes to the polynomial profile at thresholds less than 1.0. *shrug* I think there's some historical reason, I didn't chase it up.

Otherwise, the threshold not only defines when acceleration starts kicking in but it also affects steepness of the curve. So higher threshold also means acceleration kicks in slower as the speed increases. It has no effect on the low-speed deceleration.

What happens when we change the max accel factor? This factor is actually set via the AccelerationNumerator and AccelerationDenominator options (because floats used to be more expensive than buying a house). At runtime, the Xlib function of your choice is XChangePointerControl(). That's what all the traditional config tools use (xset, your desktop environment pre-libinput, etc.).

First thing that sticks out: one is not like the others. When max acceleration is 0, the factor is always zero for speeds exceeding the threshold. No user impact though, the server discards factors of 0.0 and leaves the input delta as-is.

Otherwise it's relatively unexciting, it changes the maximum acceleration without changing the incline of the function. And it has no effect on deceleration. Because the curves aren't linear ones, they don't overlap 100% but meh, whatever. The higher values are cut off in this view, but they just look like a larger version of the visible 2 and 4 curves.

Next config option: ConstantDeceleration. This one is handled outside of the profile but at the code is easy-enough to follow, it's a basic multiplier applied together with the factor. (I cheated and just did this in gnuplot directly)

Easy to see what happens with the curve here, it simply stretches vertically without changing the properties of the curve itself. If the deceleration is greater than 1, we get constant acceleration instead.

All this means with the default profile, we have 3 ways of adjusting it. What we can't directly change is the incline, i.e. the actual process of acceleration remains the same.

Velocity calculation

As mentioned above, the profile applies to a velocity so obviously we need to calculate that first. This is done by storing each delta and looking at their direction and individual velocity. As long as the direction remains roughly the same and the velocity between deltas doesn't change too much, the velocity is averaged across multiple deltas - up to 16 in the default config. Of course you can change whether this averaging applies, the max time deltas or velocity deltas, etc. I'm honestly not sure anyone ever used any of these options intentionally or with any real success.

Velocity scaling was explained above (units/ms * reporting rate). The default value for the reporting rate is 10, equivalent to 100Hz. Of the 155 frequencies currently defined in 70-mouse.hwdb, only one is 100 Hz. The most common one here is 125Hz, followed by 1000Hz followed by 166Hz and 142Hz. Now, the vast majority of devices don't have an entry in the hwdb file, so this data does not represent a significant sample set. But for modern mice, the default velocity scale of 10 is probably off between 25% and a factor 10. While this doesn't mean much for the local example (users generally just move the numbers around until they're happy enough) it means that the actual values are largely meaningless for anyone but those with the same hardware.

Of note: the synaptics driver automatically sets VelocityScale to 80Hz. This is correct for the vast majority of touchpads.

Epilogue

The graphs above show the X server's pointer acceleration for mice, trackballs and other devices and the effects of the configuration toggles. I purposely did not put any specific analysis in and/or comparison to libinput. That will come in a future post.

[1] I still have a branch somewhere where the server prints yaml to the log file which can then be extracted by shell scripts, passed on to python for processing and ++++ out of cheese error. redo from start ++++
May 09, 2018

Intro slide

Downloads

If you're curious about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of OpenTechSummit, specifically @hpdang and @mariobehling for hosting a great event.

May 07, 2018

The Vulkan specification includes a number of optional features that drivers may or may not support, as described in chapter 30.1 Features. Application developers can query the driver for supported features via vkGetPhysicalDeviceFeatures() and then activate the subset they need in the pEnabledFeatures field of the VkDeviceCreateInfo structure passed at device creation time.

In the last few weeks I have been spending some time, together with my colleague Chema, adding support for one of these features in Anvil, the Intel Vulkan driver in Mesa, called shaderInt16, which we landed in Mesa master last week. This is an optional feature available since Vulkan 1.0. From the spec:

shaderInt16 specifies whether 16-bit integers (signed and unsigned) are supported in shader code. If this feature is not enabled, 16-bit integer types must not be used in shader code. This also specifies whether shader modules can declare the Int16 capability.

It is probably relevant to highlight that this Vulkan capability also requires the SPIR-V Int16 capability, which basically means that the driver’s SPIR-V compiler backend can actually digest SPIR-V shaders that declare and use 16-bit integers, and which is really the core of the functionality exposed by the Vulkan feature.

Ideally, shaderInt16 would increase the overall throughput of integer operations in shaders, leading to better performance when you don’t need a full 32-bit range. It may also provide better overall register usage since you need less register space to store your integer data during shader execution. It is important to remark, however, that not all hardware platforms (Intel or otherwise) may have native support for all possible types of 16-bit operations, and thus, some of them might still need to run in 32-bit (which requires injecting type conversion instructions in the shader code). For Intel platforms, this is the case for operations associated with integer division.

From the point of view of the driver, this is the first time that we generally exercise lower bit-size data types in the driver compiler backend, so if you find any bugs in the implementation, please file bug reports in bugzilla!

Speaking of shaderInt16, I think it is worth mentioning its interactions with other Vulkan functionality that we implemented in the past: the Vulkan 1.0 VK_KHR_16bit_storage extension (which has been promoted to core in Vulkan 1.1). From the spec:

The VK_KHR_16bit_storage extension allows use of 16-bit types in shader input and output interfaces, and push constant blocks. This extension introduces several new optional features which map to SPIR-V capabilities and allow access to 16-bit data in Block-decorated objects in the Uniform and the StorageBuffer storage classes, and objects in the PushConstant storage class. This extension allows 16-bit variables to be declared and used as user-defined shader inputs and outputs but does not change location assignment and component assignment rules.

While the shaderInt16 capability provides the means to operate with 16-bit integers inside a shader, the VK_KHR_16bit_storage extension provides developers with the means to also feed shaders with 16-bit integer (and also floating point) input data, such as Uniform/Storage Buffer Objects or Push Constants, from the applications side, plus, it also gives the opportunity for linked shader stages in a graphics pipeline to consume 16-bit shader inputs and produce 16-bit shader outputs.

VK_KHR_16bit_storage and shaderInt16 should be seen as two parts of a whole, each one addressing one part of a larger problem: VK_KHR_16bit_storage can help reduce memory bandwith for Uniform and Storage Buffer data accessed from the shaders and / or optimize Push Constant space, of which there are only a few bytes available, making it a precious shader resource, but without shaderInt16, shaders that are fed 16-bit input data are still required to convert this data to 32-bit internally for operation (and then back again to 16-bit for output if needed). Likewise, shaders that use shaderInt16 without VK_KHR_16bit_storage can only operate with 16-bit data that is generated inside the shader, which largely limits its usage. Both together, however, give you the complete functionality.

Conclusions

We are very happy to continue expanding the feature set supported in Anvil and we look forward to seeing application developers making good use of shaderInt16 in Vulkan to improve shader performance. As noted above, this is the first time that we fully enable the shader compiler backend to do general purpose operations on lower bit-size data types and there might be things that we can still improve or optimize. If you hit any issues with the implementation, please contact us and / or file bug reports so we can continue to improve the implementation.

Last week I fixed the last change from review feedback and landed the V3D (vc5/6 kernel DRM driver) to drm-misc-next. This means, unless something exceptional happens, it’ll be in kernel 4.18. I also sent out a patch series renaming the Mesa driver to V3D as well.

For VC4, I landed Stefan Schake’s syncobj patches in the kernel. Once that makes its way to drm-next, we can merge the Mesa side (EGL_ANDROID_native_fence_sync extension) and I believe get out-of-the-box Android HWC2 support.

I also merged a couple of old DSI patches of mine that got reviewed, my DPI regression fix, and a fix from Boris for a (valid) warning that was happening about the MADVISE refcounting versus async pageflips.

Boris has been busy working on adding display properties for TV underscan. Many HDMI monitors emulate old TV style scanout where, despite being 1920x1080 pixels, they read from a smaller area of their input and scale up. HDMI has infoframes to tell the TV whether the input is formatted for that old style scanout or not (as a desktop, we would always want them to scan out 1:1), but since the HDMI conformance tests didn’t check it, most old HDMI TVs don’t change behavior on that input and just always underscan (or you have to dig around in menus to get 1:1. The overscan property on the closed source firmware configures the HVS to scale the 1920x1080 desktop down to the region that the TV will scale back up. Display will look bad, but it’s better than losing your menu bars off the edge of the screen. Once we collectively decide what the new KMS property should be, Boris’s patches should let people replicate this workaround in the KMS world.

Boris has also been working on patch series to get VC4 to work with the DSI display in the DT, whether or not the DSI display is actually connected. We need this if we want to avoid having additional closed source firmware code hacking up the DT based on its own probing of the DSI display. His new series seems to have a chance of getting past both DT, panel, and bridge maintainers, but I have honestly lost count of what iteration of probing behavior we’re on for this device (I think it’s around 10 variations, though). It was all working at one point, before DT review requirements wrecked it.

April 30, 2018

Last weekend I attended Ubucon Europe 2018 in the city center of Gijón, Spain, at the Antiguo Instituto. This 3-day conference was full of talks, sometimes 3 or even 4 at the same time! The covered topics went from Ubuntu LoCo teams to Industrial Hardware… there was at least one talk interesting for all attendees! Please check the schedule if you want to know what happened at Ubucon ;-)

Ubucon Europe 2018

I would like to say thank you to all the organization team. They did a great job taking care of both the attendees and the speakers, they organized a very nice traditional espicha that introduced good cider and the Asturian gastronomy to the people coming from any part of the world. The conference was amazing and I am very pleased to have participated in this event. I recommend you to attend it next year… lots of fun and open-source!

Regarding my talk “Introduction to Mesa, an open-source graphics API”, there were 30 people attending to it. It was great to check that so many people were interested on a brief introduction about how Mesa works, how the community collaborate together and how anyone can become contributor to Mesa, even as a non-developer. I hope I can see them contributing to Mesa in the coming future!

You can find the slides of my talk here.

Ubucon Europe 2018 talk

For VC4, I landed Stefan Schake’s patches for CTM support (notably Android color correction) and per-plane alpha (so you can fade planes in and out without rewriting the alpha in all the pixels). He also submitted patches to the kernel and Mesa for syncobjects/fence fd support (needed for Android HWC2, which I gave some simplifying feedback for but are otherwise almost ready to land.

For V3D, I incorporated Daniel Vetter’s feedback to the DRM code and resubmitted it. I’m optimistic for merging to 4.18. I then spent a while trawling the HW bug reports looking for reasons for my remaining set of GPU hangs on 7278, generating a large set of Mesa patches (most of which were assertions for those GPU restrictions I learned about, and which we aren’t triggering yet). While I was doing that, I also went through some CTS and piglit failures, and:

  • fixed gallium’s refcounting of separate stencil buffers
  • fixed reloads of separate-stencil buffers
  • added v4.x MSAA support
  • added centroid varying support
  • fixed a bit of 2101010 unorm/snorm texture support in gallium
  • fixed a few EGL tests from the GLES CTS
April 24, 2018

Been some time now since my last update on what is happening in Fedora Workstation and with current plans to release Fedora Workstation 28 in early May I thought this could be a good time to write something. As usual this is just a small subset of what the team has been doing and I always end up feeling a bit bad for not talking about the avalanche of general fixes and improvements the team adds to each release.

Thunderbolt
Christian Kellner has done a tremendous job keeping everyone informed of his work making sure we have proper Thunderbolt support in Fedora Workstation 28. One important aspect for us of this improved Thunderbolt support is that a lot of docking stations coming out will be requiring it and thus without this work being done you would not be able to use a wide range of docking stations. For a lot of screenshots and more details about how the thunderbolt support is done I recommend reading this article in Christians Blog.

3rd party applications
It has taken us quite some time to get there as getting this feature right both included a lot of internal discussion about policies around it and implementation detail. But starting from Fedora Workstation 28 you will be able to find more 3rd party software listed in GNOME Software if you enable it. The way it will work is that you as part of the initial setup will be asked if you want to have 3rd party software show up in GNOME Software. If you are upgrading you will be asked inside GNOME Software if you want to enable 3rd party software. You can also disable 3rd party software after enabling it from the GNOME Software settings as seen below:

GNOME Software settings

GNOME Software settings

In Fedora Workstation 27 we did have PyCharm available, but we have now added the NVidia driver and Steam to the list for Fedora Workstation 28.

We have also been working with Google to try to get Chrome included here and we are almost there as they merged for instance the needed Appstream metadata some time ago, but the last steps requires some tweaking of how Google generates their package repository (basically adding the appstream metadata to their yum repository) and we don’t have a clear timeline for when that will happen, but as soon as it does the Chrome will also appear in GNOME Software if you have 3rd party software enabled.

As we speak all 3rd party packages are RPMs, but we expect that going forward we will be adding applications packaged as Flatpaks too.

Finally if you want to propose 3rd party applications for inclusion you can find some instructions for how to do it here.

Virtualbox guest
Another major feature that got some attention that we worked on for this release was Hans de Goedes work to ensure Fedora Workstation could run as a virtualbox guest out of the box. We know there are many people who have their first experience with linux running it under Virtualbox on Windows or MacOSX and we wanted to make their first experience as good as possible. Hans worked with the virtualbox team to clean up their kernel drivers and agree on a stable ABI so that they could be merged into the kernel and maintained there from now on.

Firmware updates
The Spectre/Meltdown situation did hammer home to a lot of people the need to have firmware updates easily available and easy to update. We created the Linux Vendor Firmware service for Fedora Workstation users with that in mind and it was great to see the service paying off for many Linux users, not only on Fedora, but also on other distributions who started using the service we provided. I would like to call out to Dell who was a critical partner for the Linux Vendor Firmware effort from day 1 and thus their users got the most benefit from it when Spectre and Meltdown hit. Spectre and Meltdown also helped get a lot of other vendors off the fence or to accelerate their efforts to support LVFS and Richard Hughes and Peter Jones have been working closely with a lot of new vendors during this cycle to get support for their hardware and devices into LVFS. In fact Peter even flew down to the offices one of the biggest laptop vendors recently to help them resolve the last issues before their hardware will start showing up in the firmware service. Thanks to the work of Richard Hughes and Peter Jones you will both see a wider range of brands supported in the Linux Vendor Firmware Service in Fedora Workstation 28, but also a wider range of device classes.

Server side GL Vendor Neutral Dispatch
This is a bit of a technical detail, but Adam Jackson and Lyude Paul has been working hard this cycle on getting what we call Server side GLVND ready for Fedora Workstation 28. Currently we are looking at enabling it either as a zero-day update or short afterwards. so what is Server Side GLVND you say? Well it is basically the last missing piece we need to enable the use of the NVidia binary driver through XWayland. Currently the NVidia driver works with Wayland native OpenGL applications, but if you are trying to run an OpenGL application requiring X we need this to support it. And to be clear once we ship this in Fedora Workstation 28 it will also require a driver update from NVidia to use it, so us shipping it is just step 1 here. We do also expect there to be some need for further tuning once we got all the pieces released to get top notch performance. Of course over time we hope and expect all applications to become Wayland native, but this is a crucial transition technology for many of our users. Of course if you are using Intel or AMD graphics with the Mesa drivers things already work great and this change will not affect you in any way.

Flatpak
Flatpaks basically already work, but we have kept focus this time around on to fleshing out the story in terms of the so called Portals. Portals are essentially how applications are meant to be able to interact with things outside of the container on your desktop. Jan Grulich has put in a lot of great effort making sure we get portal support for Qt and KDE applications, most recently by adding support for the screen capture portal on top of Pipewire. You can ready more about that on Jan Grulichs blog. He is now focusing on getting the printing portal working with Qt.

Wim Taymans has also kept going full steam ahead of PipeWire, which is critical for us to enable applications dealing with cameras and similar on your system to be containerized. More details on that in my previous blog entry talking specifically about Pipewire.

It is also worth noting that we are working with Canonical engineers to ensure Portals also works with Snappy as we want to ensure that developers have a single set of APIs to target in order to allow their applications to be sandboxed on Linux. Alexander Larsson has already reviewed quite a bit of code from the Snappy developers to that effect.

Performance work
Our engineers have spent significant time looking at various performance and memory improvements since the last release. The main credit for the recently talked about ‘memory leak’ goes to Georges Basile Stavracas Neto from Endless, but many from our engineering team helped with diagnosing that and also fixed many other smaller issues along the way. More details about the ‘memory leak’ fix in Georges blog.

We are not done here though and Alberto Ruiz is organizing a big performance focused hackfest in Cambridge, England in May. We hope to bring together many of our core engineers to work with other members of the community to look at possible improvements. The Raspberry Pi will be the main target, but of course most improvements we do to make GNOME Shell run better on a Raspberry Pi also means improvements for normal x86 systems too.

Laptop Battery life
In our efforts to make Linux even better on laptops Hans De Goede spent a lot of time figuring out things we could do to make Fedora Workstation 28 have better battery life. How valuable these changes are will of course depend on your exact hardware, but I expect more or less everyone to have a bit better battery life on Fedora Workstation 28 and for some it could be a lot better battery life. You can read a bit more about these changes in Hans de Goedes blog.

April 23, 2018

For VC5, I renamed the kernel driver to “v3d” and submitted it to the kernel. Daniel Vetter came back right away with a bunch of useful feedback, and next week I’m resolving that feedback and continuing to work on the GMP support.

On the vc4 front, I did the investigation of the HDL to determine that the OLED matrix applies before the gamma tables, so we can expose it in the DRM for Android’s color correction. Stefan was also interested in reworking his fencing patches to use syncobjs, so hopefully we can merge those and get DRM HWC support in mainline soon. I also pushed Gustavo’s patch for using the new core DRM infrastructure for async cursor updates. This doesn’t simplify our code much yet, but Boris has a series he’s working on that gets rid of a lot of custom vc4 display code by switching more code over to the new async support.

Unfortunately, the vc4 subsystem node removal patch from last week caused the DRM’s platform device to not be on the SOC’s bus. This caused bus address translations to be subtly wrong and broke caching (so eventually the GPU would hang). I’ve shelved the patches for now.

I also rebased my user QPU submission code for the Raspberry Pi folks. They keep expressing interest in it, but we’ll see if it goes anywhere this time around. Unfortunately I don’t see any way to expose this for general distributions: vc4 isn’t capable enough for OpenCL or GL compute shaders, and custom user QPU submissions would break the security model (just like GL shaders would have without my shader validator, and I think validating user QPU submissions would be even harder than GL shaders).

As part of preparing my last two talks at LCA on the kernel community, “Burning Down the Castle” and “Maintainers Don’t Scale”, I have looked into how the Kernel’s maintainer structure can be measured. One very interesting approach is looking at the pull request flows, for example done in the LWN article “How 4.4’s patches got to the mainline”. Note that in the linux kernel process, pull requests are only used to submit development from entire subsystems, not individual contributions. What I’m trying to work out here isn’t so much the overall patch flow, but focusing on how maintainers work, and how that’s different in different subsystems.

Methodology

In my presentations I claimed that the kernel community is suffering from too steep hierarchies. And worse, the people in power don’t bother to apply the same rules to themselves as anyone else, especially around purported quality enforcement tools like code reviews.

For our purposes a contributor is someone who submits a patch to a mailing list, but needs a maintainer to apply it for them, to get the patch merged. A maintainer on the other hand can directly apply a patch to a subsystem tree, and will then send pull requests up the maintainer hierarchy until the patch lands in Linus’ tree. This is relatively easy to measure accurately in git: If the recorded patch author and committer match, it’s a maintainer self-commit, if they don’t match it’s a contributor commit.

There’s a few annoying special cases to handle:

  • Some people use different email addresses or spellings, and sometimes MTAs, patchwork and other tools used in the patch flow chain mangle things further. This could be fixed up with the mail mapping database that LWN for example uses to generate its contributor statistics. Since most maintainers have reasonable setups it doesn’t seem to matter much, hence I decided not to bother.

  • There are subsystems not maintained in git, but in the quilt patch management system. Andrew Morton’s tree is the only one I’m aware of, and I hacked up my scripts to handle this case. After that I realized it doesn’t matter, since Andrew merged exceedingly few of his own patches himself, most have been fixups that landed through other trees.

Also note that this is a property of each commit - the same person can be both a maintainer and a contributor, depending upon how each of their patches gets merged.

The ratio of maintainer self-commits compared to overall commits then gives us a crude, but fairly useful metric to measure how steep the kernel community overall is organized.

Measuring review is much harder. For contributor commits review is not recorded consistently. Many maintainers forgo adding an explicit Reviewed-by tag since they’re adding their own Signed-off-by tag anyway. And since that’s required for all contributor commits, it’s impossible to tell whether a patch has seen formal review before merging. A reasonable assumption though is that maintainers actually look at stuff before applying. For a minimal definition of review, “a second person looked at the patch before merging and deemed the patch a good idea” we can assume that merged contributor patches have a review ratio of 100%. Whether that’s a full formal review or not can unfortunately not be measured with the available data.

A different story is maintainer self-commits - if there is no tag indicating review by someone else, then either it didn’t happen, or the maintainer felt it’s not important enough work to justify the minimal effort to record it. Either way, a patch where the git author and committer match, and which sports no review tags in the commit message, strongly suggests it has indeed seen none.

An objection would be that these patches get reviewed by the next maintainer up, when the pull request gets merged. But there’s well over a thousand such patches each kernel release, and most of the pull requests containing them go directly to Linus in the 2 week long merge window, when the over 10k feature patches of each kernel release land in the mainline branch. It is unrealistic to assume that Linus carefully reviews hundreds of patches himself in just those 2 weeks, while getting hammered by pull requests all around. Similar considerations apply at a subsystem level.

For counting reviews I looked at anything that indicates some kind of patch review, even very informal ones, to stay consistent with the implied oversight the maintainer’s Signed-off-by line provides for merged contributor patches. I therefore included both Reviewed-by and Acked-by tags, including a plethora of misspelled and combined versions of the same.

The scripts also keep track of how pull requests percolate up the hierarchy, which allows filtering on a per-subsystem level. Commits in topic branches are accounted to the subsystem that first lands in Linus’ tree. That’s fairly arbitrary, but simplest to implement.

Last few years of GPU subsystem history

Since I’ve pitched the GPU subsystem against the kernel at large in my recent talks, let’s first look at what things look like in graphics:

GPU maintainer commit statistics Fig. 1 GPU total commits, maintainer self-commits and reviewed maintainer self-commits
GPU relative maintainer commit statistics Fig. 2 GPU percentage maintainer self-commits and reviewed maintainer self-commits

In absolute numbers it’s clear that graphics has grown tremendously over the past few years. Much faster than the kernel at large. Depending upon the metric you pick, the GPU subsystem has grown from being 3% of the kernel to about 10% and now trading spots for 2nd largest subsystem with arm-soc and staging (depending who’s got a big pull for that release).

Maintainer commits keep up with GPU subsystem growth

The relative numbers have a different story. First, commit rights and the fairly big roll out of group maintainership we’ve done in the past 2 years aren’t extreme by historical graphics subsystem standards. We’ve always had around 30-40% maintainer self-commits. There’s a bit of a downward trend in the years leading towards v4.4, due to the massive growth of the i915 driver, and our failure to add more maintainers and committers for a few releases. Adding lots more committers and creating bigger maintainer groups from v4.5 on forward, first for the i915 driver, then to cope with the influx of new small drivers, brought us back to the historical trend line.

There’s another dip happening in the last few kernels, due to AMD bringing in a big new team of contributors to upstream. v4.15 was even more pronounced, in that release the entirely rewritten DC display driver for AMD GPUs landed. The AMD team is already using a committer model for their staging and internal trees, but not (yet) committing directly to their upstream branch. There’s a few process holdups, mostly around the CI flow, that need to be fixed first. As soon as that’s done I expect this recent dip will again be over.

In short, even when facing big growth like the GPU subsystem has, it’s very much doable to keep training new maintainers to keep up with the increased demand.

Review of maintainer self-commits established in the GPU subsystem

Looking at relative changes in how consistently maintainer self-commits are reviewed, there’s a clear growth from mostly no review to 80+% of all maintainer self-commits having seen some formal oversight. We didn’t just keep up with the growth, but scaled faster and managed to make review a standard practice. Most of the drivers, and all the core code, are now consistently reviewed. Even for tiny drivers with small to single person teams we’ve managed to pull this off, through combining them into larger teams run with a group maintainership model.

Last few years of kernel w/o GPU history

kernel w/o GPU maintainer commit statistics Fig. 3 kernel w/o GPU maintainer self-commits and reviewed maintainer self-commits
kernel w/o GPU relative maintainer commit statistics Fig. 4 kernel w/o GPU percentage maintainer self-commits and reviewed maintainer self-commits

Kernel w/o graphics is an entirely different story. Overall, review is much less a thing that happens, with only about 30% of all maintainer self-commits having any indication of oversight. The low ratio of maintainer self-commits is why I removed the total commit number from the absolute graph - it would have dwarfed the much more interesting data on self-commits and reviewed self-commits. The positive thing is that there’s at least a consistent, if very small upward trend in maintainer self-commit reviews, both in absolute and relative numbers. But it’s very slow, and will likely take decades until there’s no longer a double standard on review between contributors and maintainers.

Maintainers are not keeping up with the kernel growth overall

Much more worrying is the trend on maintainer self-commits. Both in absolute, and much more in relative numbers, there’s a clear downward trend, going from around 25% to below 15%. This indicates that the kernel community fails to mentor and train new maintainers at a pace sufficient to keep up with growth. Current maintainers are ever more overloaded, leaving ever less time for them to write patches of their own and get them merged.

Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that.

This is a huge contrast to the “everything is getting better, bigger, and the kernel community is very healthy” fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community. Even when ignoring all the issues around conduct that I’ve raised.

It is also a huge contrast to what we’ve experienced in the GPU subsystem since aggressively rolling out group maintainership starting with the v4.5 release; by spreading the bureaucratic side of applying patches over many more people, maintainers have much more time to create their own patches and get them merged. More crucially, experienced maintainers can focus their limited review bandwidth on the big architectural design questions since they won’t get bogged down in the minutiae of every single simple patch.

4.16 by subsystem

Let’s zoom into how this all looks at a subsystem level, looking at just the recently released 4.16 kernel.

Most subsystems have unsustainable maintainer ratios

Trying to come up with a reasonable list of subsystems that have high maintainer commit ratios is tricky; some rather substantial pull requests are essentially just maintainers submitting their own work, giving them an easy 100% score. But of course that’s just an outlier in the larger scope of the kernel overall having a maintainer self-commit ratio of just 15%. To get a more interesting list of subsystems we need to look at only those with a group of regular contributors and more than just 1 maintainer. A fairly arbitrary cut-off of 200 commits or more in total seems to get us there, yielding the following top ten list:

subsystem total commits maintainer self-commits maintainer ratio
GPU 1683 614 36%
KVM 257 91 35%
arm-soc 885 259 29%
linux-media 422 111 26%
tip (x86, core, …) 792 125 16%
linux-pm 201 31 15%
staging 650 61 9%
linux-block 249 20 8%
sound 351 26 7%
powerpc 235 16 7%

In short there’s very few places where it’s easier to become a maintainer than in the already rather low, roughly 15%, the kernel scores overall. Outside of these few subsystems, the only realistic way is to create a new subsystem, somehow get it merged, and become its maintainer. In most subsystems being a maintainer is an elite status, and the historical trends suggest it will only become more so. If this trend isn’t reversed, then maintainer overload will get a lot worse in the coming years.

Of course subsystem maintainers are expected to spend more time reviewing and managing other people’s contribution. When looking at individual maintainers it would be natural to expect a slow decline in their own contributions in patch form, and hence a decline in self-commits. But below them a new set of maintainers should grow and receive mentoring, and those more junior maintainers would focus more on their own work. That sustainable maintainer pipeline seems to not be present in many kernel subsystems, drawing a bleak future for them.

Much more interesting is the review statistics, split up by subsystem. Again we need a cut-off for noise and outliers. The big outliers here are all the pull requests and trees that have seen zero review, not even any Acked-by tags. As long as we only look at positive examples we don’t need to worry about those. A rather low cut-off of at least 10 maintainer self-commits takes care of other random noise:

subsystem total commits maintainer self-commits maintainer review ratio
f2fs 72 12 100%
XFS 105 78 100%
arm64 166 23 91%
GPU 1683 614 83%
linux-mtd 99 12 75%
KVM 257 91 74%
linux-pm 201 31 71%
pci 145 37 65%
remoteproc 19 14 64%
clk 139 14 64%
dma-mapping 63 60 60%

Yes, XFS and f2fs have their shit together. More interesting is how wide the spread in the filesystem code is; there’s a bunch of substantial fs pulls with a review ratio of flat out zero. Not even a single Acked-by. XFS on the other hand insists on full formal review of everything - I spot checked the history a bit. f2fs is a bit of an outlier with 4.16, barely getting above the cut-off. Usually it has fewer patches and would have been excluded.

Everyone not in the top ten taken together has a review ratio of 27%.

Review double standards in many big subsystems

Looking at the big subsystems with multiple maintainers and huge groups of contributors - I picked 500 patches as the cut-off - there’s some really low review ratios: Staging has 7%, networking 9% and tip scores 10%. Only arm-soc is close to the top ten, with 50%, at the 14th position.

Staging having no standard is kinda the point, but the other core subsystems eschewing review is rather worrisome. More than 9 out of 10 maintainer self-commits merged into these core subsystem do not carry any indication that anyone else ever looked at the patch and deemed it a good idea. The only other subsystem with more than 500 commits is the GPU subsystem, at 4th position with a 83% review ratio.

Compared to maintainers overall the review situation is looking a lot less bleak. There’s a sizeable group of subsystems who at least try to make this work, by having similar review criteria for maintainer self-commits than normal contributors. This is also supported by the rather slow, but steady overall increase of reviews when looking at historical trend.

But there’s clearly other subsystems where review only seems to be a gauntlet inflicted on normal contributors, entirely optional for maintainers themselves. Contributors cannot avoid review, because they can’t commit their own patches. When maintainers outright ignore review for most of their patches this creates a clear double standard between maintainers and mere contributors.

One year ago I wrote “Review, not Rocket Science” on how to roll out review in your subsystem. Looking at this data here I can close with an even shorter version:

What would Dave Chinner do?

Thanks a lot to Daniel Stone, Dave Chinner, Eric Anholt, Geoffrey Huntley, Luce Carter and Sean Paul for reading and commenting on drafts of this article.

April 22, 2018

Downloads

If you're curious about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.

Intro slide

Downloads

If you're curious about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.

Mmm, a Moving Mesa Midgard Cube

Mmm, a Moving Mesa Midgard Cube

In the last Panfrost status update, a transitory “half-way” driver was presented, with the purpose of easing the transition from a standalone library abstracting the hardware to a full-fledged OpenGL ES driver using the Mesa and Gallium3D infrastructure.

Since then, I’ve completed the transition, creating such a driver, but retaining support for out-of-tree testing.

Almost everything that was exposed with the custom half-way interface is now available through Gallium3D. Attributes, varyings, and uniforms all work. A bit of rasterisation state is supported. Multiframe programs work, as do programs with multiple non-indexed, direct draws per frame.

The result? The GLES test-cube demo from freedreno runs using the Mali T760 GPU present in my RK3288 laptop, going through the Mesa/Gallium3D stack. Of course, there’s no need to rely on the vendor’s proprietary compilers for shaders – the demo is using shaders from the free, NIR-based Midgard compiler.

Look ma, no blobs!


In the past three weeks since the previous update, all aspects of the project have seen fervent progress, culminating in the above demo. The change list for the core Gallium driver is lengthy but largely routine: abstracting features about the hardware which were already understood and integrating it with Gallium, resolving bugs which are discovered in the process, and repeating until the next GLES test passes normally. Enthusiastic readers can read the code of the driver core on GitLab.

Although numerous bugs were solved in this process, one in particular is worthy of mention: the “tile flicker bug”, notorious to lurkers of our Freenode IRC channel, #panfrost. Present since the first render, this bug resulted in non-deterministic rendering glitches, where particular tiles would display the background colour in lieu of the render itself. The non-deterministic nature had long suggested it was either the result of improper memory management or a race condition, but the precise cause was unknown. Finally, the cause was narrowed down to a race condition between the vertex/tiler jobs responsible for draws, and the fragment job responsible for screen painting. With this cause in mind, a simple fix squashed the bug, hopefully for good; renders are now deterministic and correct. Huge thanks to Rob Clark for letting me use him as a sounding board to solve this.

In terms of decoding the command stream, some miscellaneous GL state has been determined, like some details about tiler memory management, texture descriptors, and shader linkage (attribute and varying metadata). By far, however, the most significant discovery was the operation of blending on Midgard. It’s… well, unique. If I had known how nuanced the encoding was – and how much code it takes to generate from Gallium blend state – I would have postponed decoding like originally planned.

In any event, blending is now understood. Under Midgard, there are two paths in the hardware for blending: the fixed-function fast path, and the programmable slow path, using “blend shaders”. This distinction has been discussed sparsely in Mali documentation, but the conditions for the fast path were not known until now. Without further ado, the fixed-function blending hardware works when:

  • The blend equation is either ADD, SUBTRACT, or REVERSE_SUBTRACT (but not MIN or MAX)
  • The “dominant” blend function is either the source/destination colour/alpha, or the special case of a constant ONE or ZERO (but not a constant colour or anything fancier), or the additive complement thereof.
  • The non-dominant blend function is either identical to the dominant blend function, or one of the constant special cases.

If these conditions are not met, a blend shader is used instead, incurring a presently unknown performance hit.

By dominant and non-dominant modes, I’m essentially referring to the more complex and less complex blend functions respectively, comparing between the functions for the source and the destination. The exact details of the encoding are a little hairy beyond the scope of this post but are included in the corresponding Panfrost headers and the corresponding code in the driver.

In any event, this separation between fixed-function and programmable blending is now more or less understood. Additionally, blend shaders themselves are now intelligible with Connor Abbott’s Midgard disassembler; blend shaders are just normal Midgard shaders, with an identical ISA to vertex and fragment shaders, and will eventually be generated with the existing NIR compiler. With luck, we should be able to reuse code from the NIR compiler for the vc4, an embedded GPU lacking fixed-function hardware for any blending whatsoever. Additionally, blend shaders open up some interesting possibilities; we may be able to enable developers to write blend shaders themselves in GLSL through a vendored GL extension. More practically, blend shaders should enable implementation of all blend modes, as this is ES 3.2 class hardware, as well as presumably logic operations.

Command-stream work aside, the Midgard compiler also saw some miscellaneous improvements. In particular, the mystery surrounding varyings in vertex shaders has finally been cracked. Recall that gl_Position stores are accomplished by writing the screen-space coordinate to the special register r27, and then including a st_vary instruction with the mysterious input register r1 to the appropriate address. At the time, I had (erroneously) assumed that the r27 store was responsible for the write, and the subsequent instruction was a peculiar errata workaround.

New findings shows it is quite the opposite: it is the store instruction that does the store, but it uses the value of r27, not r1 for its input. What does the r1 signify, then? It turns out that two different registers can be used for varying writes, r26 and r27. The register in the store instruction selects between these registers: a value of zero uses r26 whereas a value of one uses r27. Why, then, are there two varying source registers? Midgard is a VLIW architecture, in this case meaning that it can execute two store instructions simultaneously for improved performance. To achieve this parallelism, it needs two source registers, to be able to write two different values to the two varyings.

This new understanding clarifies some previous peculiar disassemblies, as the purpose of writes to r26 are now understood. This discovery would have been easier had r26 not also represented a reference to an embedded constant!

More importantly, it enables us to implement varying stores in the vertex shader, allowing for smoothed demos, like the shading on test-cube, to work. As a bonus, it cleans up the code relating to gl_Position writes, as we now know they can use the same compiler code path as writes to normal varyings.

Besides varyings, the Midgard compiler also saw various improvements, notably including a basic register allocator, crucial for compiling even slightly nontrivial shaders, such as that of the cube.


Beyond Midgard, my personal focus, Bifrost has continued to see sustained progress. Connor Abbott has continued decoding the new shader ISA, uncovering and adding disassembler support for a few miscellaneous new instructions and in particular branching. Branching under Bifrost is somewhat involved – the relevant disassembler commit added over two hundred lines of code – with semantics differing noticeably from Midgard. He has also begun work porting the panwrap infrastructure for capturing, decoding, and replaying command streams from Midgard to Bifrost, to pave the way for a full port of the driver to Bifrost down the line.

While Connor continues work on his disassembler, Lyude Paul has been working on a Bifrost assembler compatible with the disassembler’s output, a milestone necessary to demonstrate understanding of the instruction set and a useful prerequisite to writing a Bifrost compiler.


Going forward, I plan on cleaning up technical debt accumulated in the driver to improve maintainability, flexibility, and perhaps performance. Additionally, it is perhaps finally time to address the elephant in the command stream room: textures. Prior to this post, there were two major bugs in the driver: the missing tile bug and the texture reading bug. Seeing as the former was finally solved with a bit of persistence, there’s hope for the latter as well.

May the pans frost on.

April 21, 2018

In February after Plasma 5.12 was released we held a meeting on how we want to improve Wayland support in Plasma 5.13. Since its beta is now less than one month away it is time for a status report on what has been achieved and what we still plan to work on.

Also today started a week-long Plasma Sprint in Berlin, what will hopefully accelerate the Wayland work for 5.13. So in order to kick-start the sprint this is a good opportunity to sum up where we stand now.

QT_QPA_PLATFORM

Let us start with a small change, but with huge implications: the decision to not set the environment variable QT_QPA_PLATFORM to wayland anymore in Plasma’s startup script.

Qt based applications use this environment variable to determine the platform plugin they should load. The environment variable was set to wayland in Plasma’s Wayland session in order to tell Qt based applications that they should act like Wayland native clients. Otherwise they load the default plugin, which is xcb and means that they try to be X clients in a Wayland session.

This also works, thanks to Xwayland, but of course in a Wayland session we want as many applications to be Wayland native clients as possible. That was probably the rationale behind setting the environment variable in the first place. The problem is though, that this is not always possible. While KDE applications are compiled with the Qt Wayland platform plugin, some third-party Qt applications were not. A prominent example is the Telegram desktop client, which would just give up on launch in a Wayland session because of that.

With the change this is no longer a problem. Not being forced in its QT_QPA_PLATFORM environment variable to some unavailable plugin the Telegram binary will just execute using the xcb plugin and therefore run as Xwayland client in our Wayland session.

One drawback is that this now applies to all Qt based applications. While the Plasma processes were adjusted to now be able to select the Wayland plugin themselves based on session information other applications might not although the wayland plugin might be availale and then still run as Xwayland clients. But this problem might go away with Qt 5.11, which is supposed to either change the behavior of QT_QPA_PLATFORM itself or feature a new environment variable such that an application can express preferences for plugins and fallback if to the first supported one by the session.

Martin Flöser, who wrote most of the patches for this change, talked about it and the consequences in his blog as well.

Screencasts

A huge topic on Desktop Wayland was screen recording and sharing. In the past application developers had a single point of entry to write for in order to receive screencasts: the XServer. In Wayland the compositor as Wayland server has replaced the XServer and so an application would need to talk to the compositor if it wants access to screen content.

This rightfully raised the fear that now developers of screencast apps would need to write for every other Wayland compositor a different backend to receive video data. As a spoiler: luckily this won’t be necessary.

So how did we achieve this? First of all support for screencasts had to be added to KWin and KWayland. This was done by Oleg Chernovskiy. While this is still a KWayland specific interface the trick was now to proxy via xdg-desktop-portal and PipeWire. Jan Grulich jumped in and implemented the necessary backend code on the xdg-desktop-portal side.

A screencast app therefore in the future only needs to talk with xdg-desktop-portal and receive video data through PipeWire on Plasma Wayland. Other compositors then will have to add a similar backend to xdg-desktop-portal as it was done by Jan, but the screencast app stays the same.

Configure your mouse

I wrote a system settings module (KCM) for touchpad configuration on Wayland last year. The touchpad KCM had higher priority than the Mouse KCM back then because there was no way to configure anything about a touchpad on Wayland, while there was a small hack in KWin to at least control the mouse speed.

Still this was no long term solution in regards to the Mouse KCM, and so I wrote a libinput based Wayland Mouse KCM similar to the one I wrote for touchpads.

Wayland Mouse KCM

I went one step further and made the Mouse KCM interact with Libinput on X as well. There was some work on this in the Mouse KCM done in the past, but now it features a fitting Ui like on Wayland and uses the same backend abstraction.

Dmabuf-based Wayland buffers

Fredrik Höglund uploaded patches for review to add support for dmabuf-based Wayland buffer sharing. This is a somewhat technical topic and will not directly influence the user experience in 5.13. But it is to see in the context of bigger changes upstream in Wayland, X and Mesa. The keyword here is buffer modifiers. You can read more about them in this article by Daniel Stone.

Per output color correction

Adjusting the colors and overall gamma of displays individually is a feature, which is quite important to some people and is provided in a Plasma X session via KGamma in a somewhat simplistic fashion.

Since I wrote Night Color as a replacement for Redshift in our Wayland session not long ago I was already somewhat involved in the color correction game.

But this game is becoming increasingly more complex: my current solution for per output color correction includes changes to KWayland, KWin, libkscreen, libcolorcorrect and adds a KCM replacing KGamma on Wayland to let the user control it.

Additionally there are different opinions on how this should work in general and some explanations by upstream more confused me than guided me to the one best solution. I will most likely ignore these opinions for the moment and concentrate on the one solution I have at the moment, which might already be sufficient for most people. I believe it will be actually quite nice to use, for example I plan to provide a color curve widget borrowed from Krita to set the color curves via some control points and curve interpolation.

More on 5.13 and beyond

In the context of per output color correction another topic, which I am working on right now, is abstracting our output classes in KWin’s Drm and Virtual backends to the compositing level. This will first enable my color correction code to be nicely integrated and I anticipate in the long term will be even necessary for two other far more important topics: layered rendering and compositing per output, what will improve performance and allow different refresh rates on multi-monitor setups. But these two tasks will need much more time.

Scaling on Wayland can be done per output and while I am no expert on this topic from what I heard because of that and for other reasons scaling should work much better on Wayland than on X. But there is currently one huge drawback in our Wayland session: we can only scale integer factors. To change this David Edmundson has posted patches for review adding support for xdg-output to KWayland and to KWin. This is one step in allowing fractional scaling on Wayland. There is more to do according to Davd and since he takes part in the sprint I hope we can talk about scaling on Wayland extensively in order for me to better understand the current mechanism and what all needs to be changed in order to provide fractional scaling.

At last there is cursor locking, which is in theory supported by KWin, but in practice does not work well in the games I tried it with. I hope to start work on this topic before 5.13, but I will most likely not finish it for 5.13.

So overall there is lots of progress, but still quite some work to do. In this regard I am certain the Plasma Sprint this week will be fruitful. We can discuss problems, exchange knowledge and simply code in unity (no pun intended). If you have questions or feedback that you want us to address at this sprint, feel free to comment this article.

April 17, 2018

For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:


Sponza rendering

This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).

The following list includes the main features implemented in the demo:

  • Depth pre-pass
  • Forward and deferred rendering paths
  • Anisotropic filtering
  • Shadow mapping with Percentage-Closer Filtering
  • Bump mapping
  • Screen Space Ambient Occlusion (only on the deferred path)
  • Screen Space Reflections (only on the deferred path)
  • Tone mapping
  • Anti-aliasing (FXAA)

I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn’t sure how to scope it. Eventually I decided to write a “frame analysis” post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.

To avoid making the post too dense I won’t go into too much detail while describing each render pass, so don’t expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don’t make any promises though!).

If you’re interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!

Step 0: Culling

This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn’t affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.

In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera’s frustum, which determines the area of the 3D space that is visible to the camera.

The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera’s frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.

The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.

Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn’t change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.

Step 1: Depth pre-pass

This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.

One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won’t eliminate all the instances of the problem in the general case.

Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.

Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don’t even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don’t hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.

Depth pre-pass output

The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That’s why the picture gets brighter as we move further away.

Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.

In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.

Step 2: Shadow map

In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.

I already covered how shadow mapping works in previous series of posts, so if you’re interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).

With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light’s perspective) or in the light (visible from our light’s perspective) and shade it accordingly.

From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera’s and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.

Shadow map

In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.

In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):

Sponza rendering with 1024×1024 shadow map

You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.

Step 3: GBuffer

In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).

What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.

Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:

  1. Normal vector
  2. Diffuse color
  3. Specular color
  4. Position of the fragment from the point of view of the light (for shadow mapping)

It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don’t have dedicated video memory (such as my Intel GPU).

In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera’s projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don’t need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.

Let’s have a look at the various GBuffer textures we produce in this stage:

Normal vectors

GBuffer normal texture

Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera’s view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera’s view direction are blue (positive Z).

It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.

For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:

GBuffer normal texture (bump mapping disabled)

To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:

Bump mapping enabled
Bump mapping disabled

All the extra detail in the columns is the sole result of the bump mapping technique.

Diffuse color

GBuffer diffuse texture

Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn’t implement a lighting pass that considers how the light source interacts with the scene.

Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.

Specular color

GBuffer specular texture

This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.

Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.

For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:

Specular reflection on cloth embroidery

The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).

Fragment positions from Light

GBuffer light-space position texture

Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.

Step 4: Screen Space Ambient Occlusion

With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.

The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:

SSAO disabled

Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:

SSAO enabled

Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.

To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):

SSAO output texture

In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.

Step 6: Lighting pass

Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.

The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.

Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.

We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.

The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.

After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:

Lighting pass output

Step 7: Tone mapping

After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).

To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.

The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.

In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):

Tone mapping disabled
Tone mapping enabled

Step 8: Screen Space Reflections (SSR)

The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.

There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using “Planar Reflections”, which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn’t scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).

So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.

As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don’t have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don’t have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won’t go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.

I should mention that my take on this is fairly basic and doesn’t implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.

The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:

Capturing reflections

I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.

The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.

One small note on the GBuffer storage: because this is a single floating-point value, we don’t necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn’t support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).

Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material’s roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.

Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.

The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample’s it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.

As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.

The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:

Reflection texture

Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.

Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won’t go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.

Considering material roughness

Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.

Alpha blending

Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:

SSR output

Step 9: Anti-aliasing (FXAA)

So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.

In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:

Anti-aliased output

The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.

Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.

Conclusions and source code

So that’s all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it’s been fun writing one myself and I only hope that this was interesting to someone else.

This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.

April 16, 2018

On the vc4 front, I did the investigation of the HDL to determine that the OLED matrix applies before the gamma tables, so we can expose it in the DRM for Android’s color correction. Stefan was also interested in reworking his fencing patches to use syncobjs, so hopefully we can merge those and get DRM HWC support in mainline soon.

I also took a look at a warning we’re seeing when a cursor with a nonzero hotspot goes to the upper left corner of the screen – unfortunately, fixing it properly looks like it’ll be a bit of a rework.

I finally took a moment to port over an etnaviv change to remove the need for a DRM subsystem node in the DT. This was a request from Rob Herring long ago, but etnaviv’s change finally made it clear what we should be doing instead.

For vc5, I stabilized the GPU scheduler work and pushed it to my main branch. I’ve now started working on using the GMP to isolate clients from each other (important for being able to have unprivileged GPU workloads running alongside X, and also for making sure that say, some misbehaving webgl doesn’t trash your X server’s other window contents). Hopefully once this security issue is resolved, I can (finally!) propose merging it to the kernel.

April 13, 2018

Dart iMX 8M

The i.MX6 platform has for the past few years enjoyed a large effort to add upstream support to Linux and surrounding projects. Now it is at the point where nothing is really missing any more. Improvements are still being made, to the graphics driver for i.MX 6, but functionally it is complete.

Etnaviv driver development timeline

The i.MX8 is a different story. The newly introduced platform, with hardware still difficult to get access to, is seeing lots of work being done, but much still remains to be done.

That being said, initial support for the GPU, the Vivante GC7000, is in place and is able to successfully run Wayland/Weston, glmark, etc. This should also mean that running Android ontop of the currently not-quite-upstream stack is …

April 09, 2018

I continued spending time on VC5 in the last two weeks.

First, I’ve ported the driver over to the AMDGPU scheduler. Prior to this, vc4 and vc5’s render jobs get queued to the HW in the order that the GL clients submit them to the kernel. OpenGL requires that jobs within a client effectively happen in that order (though we do some clever rescheduling in userspace to reduce overhead of some render-to-texture workloads due to us being a tiler). However, having submission order to the kernel dictate submission order to the HW means that a single busy client (imagine a crypto miner) will starve your desktop workload, since the desktop has to wait behind all of the bulk-work jobs the other client has submitted.

With the AMDGPU scheduler, each client gets its own serial run queue, and the scheduler picks between them as jobs in the run queues become ready. It also gives us easy support for in-fences on your jobs, one of the requirements for Android. All of this is with a bit less vc5 driver code than I had for my own, inferior scheduler.

Currently I’m making it most of the way through piglit and conformance test runs, before something goes wrong around the time of a GPU reset and the kernel crashes. In the process, I’ve improved the documentation on the scheduler’s API, and hopefully this encourages other drivers to pick it up.

Second, I’ve been working on debugging some issues that may be TLB flushing bugs. On the piglit “longprim” test, we go through overflow memory quickly, and allocating overflow memory involves updating PTEs and then having the GPU read from those in very short order. I see a lot of GPU segfaults on non-writable PTEs where the new overflow BO was allocated just after the last one (so maybe the lookups that happened near the end of the last one pre-fetched some PTEs from our space?). The confusing part is that I keep getting write errors far past where I would have expected any previous PTE lookups to have gone. Yet, outside of this case and maybe a couple of others within piglit and the CTS, we seem to be completely fine at PTE updates.

On the VC4 front, I wrote some docs for what I think the steps are for people that want to connect new DSI panels to Raspberry Pi. I reviewed Stefan’s patches for using the CTM for color correction on Android (promising, except I’m concerned it applies at the wrong stage of the DRM display pipeline), and some of Boris’s work on async updates (simplifying our cursor and async pageflip path). I also reviewed an Intel patch that’s necessary for a core DRM change we want for our SAND display support, and a Mesa patch fixing a regression with the new modifiers code.

April 08, 2018

To reduce the number of bugs filed against libinput consider this a PSA: as of GNOME 3.28, the default click method on touchpads is the 'clickfinger' method (see the libinput documentation, it even has pictures). In short, rather than having a separate left/right button area on the bottom edge of the touchpad, right or middle clicks are now triggered by clicking with 2 or 3 fingers on the touchpad. This is the method macOS has been using for a decade or so.

Prior to 3.28, GNOME used the libinput defaults which vary depending on the hardware (e.g. mac touchpads default to clickfinger, most other touchpads usually button areas). So if you notice that the right button area disappeared after the 3.28 update, either start using clickfinger or reset using the gnome-tweak-tool. There are gsettings commands that achieve the same thing if gnome-tweak-tool is not an option:


$ gsettings range org.gnome.desktop.peripherals.touchpad click-method
enum
'default'
'none'
'areas'
'fingers'
$ gsettings get org.gnome.desktop.peripherals.touchpad click-method
'fingers'
$ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas'

For reference, the upstream commit is in gsettings-desktop-schemas.

Note that this only affects so-called ClickPads, touchpads where the entire touchpad is a button. Touchpads with separate physical buttons in front of the touchpad are not affected by any of this.

April 04, 2018

In the last update of the free software Panfrost driver, I unveiled the Midgard shader compiler. In the two weeks since then, I’ve shifted my attention from shaders back to the command stream, the fixed-function part of the pipeline. A shader compiler is only useful if there’s a way to run the shaders, after all!

The basic parts of the command stream have been known since the early days of the project, but in the past weeks, I methodically went through the OpenGL ES 2.0 specification searching for new features, writing test code to iterate the permutations, discovering how the feature is encoded in the command stream, and writing a decoder for it. This tedious process is at the heart of any free graphics driver project, but with patience, it is effective.

Thus, since the previous post, I have decoded the fields corresponding to: framebuffer clear flags, fragment discard hinting, viewports, blend shaders, blending colour masks, antialiasing (MSAA), face culling, depth factor/units, the stencil test, the depth test, depth ranges, dithering, texture channel swizzling, texture compare functions, texture wrap modes, alpha coverage, and attribute/varying types.

That was a doozy!

This marks an important milestone: excepting textures, framebuffer objects, and fancy blend modes, the command stream needed for OpenGL ES 2.0 is almost entirely understood. For context on why those features are presently missing, we have not yet been able to replay a sample with textures or framebuffer objects, presumably due to a bug in the replay infrastructure. Until we can do this, no major work can occur for them. Figuring this bit out is high priority, but work on this area is mixed in with work on other parts of the project, to avoid causing a stall (and a lame blog post in two weeks with nothing to report back). As for fancy blend modes, our hardware has a peculiar design involving programmable blending as well as a fixed-function subset of the usual pipeline. Accordingly, I’m deferring work on this obscure feature until the rest of the driver is mature.

On the bright side, we do understand more than enough to begin work on a real driver. Thus, I cordially present the one and only Half-Way Driver! Trademark pending. Name coined by yours truly about five minutes ago.

The premise for this driver is simple: to verify that our understanding of the hardware is sound, we need to write a driver that is higher level than the simple decoded replays. And of course, we want to write a real driver, within Mesa and using Gallium3D infrastructure; after all, the end-goal of the project is to enable graphics applications to use the hardware with free software. It’s pretty hard to drive the hardware without a driver – I should know.

On the other hand, it is preferable to develop this driver independently of Mesa and Gallium3D, to retain control of the flow of the codebase, to speed up development, and to simplify debugging. Mesa and Gallium3D are large codebases; while this is necessary for production use, the sheer number of lines of code contained becomes a cumbersome burden to early driver development. As an added incentive to avoid building within their infrastructure, Mesa recompiles are somewhat slow with hardware like mine: as stated, I use my, ahem, low-power RK3288 laptop for development. Besides, while I’m still discovering new aspects to the hardware in each development session, I could do without the looming, ever-present risk of upstream merge conflicts.

The solution – the creatively named Half Way Driver – is a driver that is half-way between the opposite development strategies of a replay-driven, independent toy driver versus a mature in-tree Mesa driver. In particularly, the idea is to abstract a working replay into command stream constructors that follow Gallium3D conventions, including the permissively licensed Gallium3D headers themselves. This approach combines the benefits of each side: development is fast and easy, build times are short, and once the codebase is mature, it will be simple to move into Mesa itself and gain, almost for free, support for OpenGL, along with a number of other compatible state trackers. As an intermediate easing step, we may hook into this out-of-tree driver from softpipe, the reference software rasteriser in Gallium3D, progressively replacing software functionality with hardware-accelerated routines as possible.

In any event, this new driver is progressing nicely. At the moment, only clearing uses the native Gallium3D interface; the list of Galliumified functions will expand shortly. On the other hand, with a somewhat lower level interface, corresponding closely to the command stream, the driver supports the basic structures needed for rendering 3D geometry and running shaders. After some debugging, taking advantage of the differential tracing infrastructure originally built up to analyse the blob, the driver is able to support multiple draws over multiple frames, allowing for some cute GPU-accelerated animations!

Granted, by virtue of our capture-replay-decode workflow, the driver is not able to render anything that a previous replay could not, greatly limiting my screenshot opportunities. C’est la vie, je suppose. But hey, trust that seeing multiple triangles with different rendering states drawn in the same frame is quite exciting when you’ve been mashing your head against your keyboard for hours comparing command stream traces that are thousands of lines long.

In total, this work-in-progress brings us much closer to having a real Gallium3D driver, at which point the really fun demos start. (I’m looking at you, es2gears!)


On the shader side, progress continues to be steady. In the course of investigating blending on Midgard, including the truly bizarre “blend shaders” required for nontrivial blend modes, I uncovered a number of new opcodes relating to integers. In particular, the disassembler is now aware of the bitwise operations, which are used in this blend shader. For the compiler, I introduced a few new workarounds, presumably due to hardware errata, whose necessity was uncovered by improvements in the command stream.

For Bifrost shaders, Connor has continued his work decoding the instruction set. Notably, his recent changes enable complete disassembly of simple vertex shaders. In particular, he discovered a space-saving trick involving a nuanced mechanism for encoding certain registers, which disambiguated his previous disassembled shaders. Although he realised this fact earlier on, it’s also worth noting that there are great similarities to Midgard vertex shaders which were uncovered a few weeks ago – good news for when a Bifrost compiler is written! Among other smaller changes, he also introduced support for half-floats (fp16) and half-ints (int16), which implies a new set of instruction opcodes. He has also gathered initial traces of the Bifrost command stream, with an intent of gauging the difficulty in porting the current Midgard driver to Bifrost as well, allowing us to test shaders on the elegant new Gxx chips. In total, understanding of Bifrost progresses well; while Midgard is certainly leading the driver effort, the gap is closing.

In the near future, we’ll be Galliumising the driver. Stay tuned for scenes from our next episode!

March 27, 2018

The VCHI patches for Raspberry Pi are now merged to staging-next, which is a big step forward. It should probe by default on linux-next, though we’ve still got a problem with vchiq_test -f, as Stefan Wahren found. Dave has continued working on the v4l2 driver and hopefully we’ll get to merge over it soon.

After my burst of work on VC4, though, it was time to get back to VC5. I’ve been working on GLES conformance again, fixing regressions created by new tests (one of which would wedge the GPU such that it never recovered), and pushing up to about a 98% pass rate. I also got 7278 up and running, and it’s at about 97% now. There is at least one class of GPU hangs to resolve in it before it should match 7268. Some of the pieces from this VC5/6 effort included:

  • Adding register spilling support
  • Fixed 2101010 support in a few places
  • Fixed early z configuration within a frame
  • Fixed disabling of transform feedback on 7278
  • Fixed setup of large transform feedback outputs
  • Fixed transform feedback output with points (common in the CTS)
  • Fixed some asserts in core Mesa that we were the first to hit
  • Fixed gallium blits to integer textures (TGSI is the worst).
March 20, 2018

For some time now I have been working on and off on a personal project with no other purpose than toying a bit with Vulkan and some rendering and shading techniques. Although I’ll probably write about that at some point, in this post I want to focus on Vulkan’s specialization constants and how they can provide a very visible performance boost when they are used properly, as I had the chance to verify while working on this project.

The concept behind specialization constants is very simple: they allow applications to set the value of a shader constant at run-time. At first sight, this might not look like much, but it can have very important implications for certain shaders. To showcase this, let’s take the following snippet from a fragment shader as a case study:

layout(push_constant) uniform pcb {
   int num_samples;
} PCB;

const int MAX_SAMPLES = 64;
layout(set = 0, binding = 0) uniform SamplesUBO {
   vec3 samples[MAX_SAMPLES];
} S;

void main()
{
   ...
   for(int i = 0; i < PCB.num_samples; ++i) {
      vec3 sample_i = S.samples[i];
      ...
   }
   ...
}

That is a snippet taken from a Screen Space Ambient Occlusion shader that I implemented in my project, a popular techinique used in a lot of games, so it represents a real case scenario. As we can see, the process involves a set of vector samples passed to the shader as a UBO that are processed for each fragment in a loop. We have made the maximum number of samples that the shader can use large enough to accomodate a high-quality scenario, but the actual number of samples used in a particular execution will be taken from a push constant uniform, so the application has the option to choose the quality / performance balance it wants to use.

While the code snippet may look trivial enough, let’s see how it interacts with the shader compiler:

The first obvious issue we find with this implementation is that it is preventing loop unrolling to happen because the actual number of samples to use is unknown at shader compile time. At most, the compiler could guess that it can’t be more than 64, but that number of iterations would still be too large for Mesa to unroll the loop in any case. If the application is configured to only use 24 or 32 samples (the value of our push constant uniform at run-time) then that number of iterations would be small enough that Mesa would unroll the loop if that number was known at shader compile time, so in that scenario we would be losing the optimization just because we are using a push constant uniform instead of a constant for the sake of flexibility.

The second issue, which might be less immediately obvious and yet is the most significant one, is the fact that if the shader compiler can tell that the size of the samples array is small enough, then it can promote the UBO array to a push constant. This means that each access to S.samples[i] turns from an expensive memory fetch to a direct register access for each sample. To put this in perspective, if we are rendering to a full HD target using 24 samples per fragment, it means that we would be saving ourselves from doing 1920x1080x24 memory reads per frame for a very visible performance gain. But again, we would be loosing this optimization because we decided to use a push constant uniform.

Vulkan’s specialization constants allow us to get back these performance optmizations without sacrificing the flexibility we implemented in the shader. To do this, the API provides mechanisms to specify the values of the constants at run-time, but before the shader is compiled.

Continuing with the shader snippet we showed above, here is how it can be rewritten to take advantage of specialization constants:

layout (constant_id = 0) const int NUM_SAMPLES = 64;
layout(std140, set = 0, binding = 0) uniform SamplesUBO {
   vec3 samples[NUM_SAMPLES];
} S;

void main()
{
   ...
   for(int i = 0; i < NUM_SAMPLES; ++i) {
      vec3 sample_i = S.samples[i];
      ...
   }
   ...
}

We are now informing the shader that we have a specialization constant NUM_SAMPLES, which represents the actual number of samples to use. By default (if the application doesn’t say otherwise), the specialization constant’s value is 64. However, now that we have a specialization constant in place, we can have the application set its value at run-time, like this:

VkSpecializationMapEntry entry = { 0, 0, sizeof(int32_t) };
   VkSpecializationInfo spec_info = {
      1,
      &entry,
      sizeof(uint32_t),
      &config.ssao.num_samples
   };
   ...

The application code above sets up specialization constant information for shader consumption at run-time. This is done via an array of VkSpecializationMapEntry entries, each one determining where to fetch the constant value to use for each specialization constant declared in the shader for which we want to override its default value. In our case, we have a single specialization constant (with id 0), and we are taking its value (of integer type) from offset 0 of a buffer. In our case we only have one specialization constant, so our buffer is just the address of the variable holding the constant’s value (config.ssao.num_samples). When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo. At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.

It is important to remark that specialization takes place when we create the pipeline, since that is the only moment at which Vulkan drivers compile shaders. This makes specialization constants particularly useful when we know the value we want to use ahead of starting the rendering loop, for example when we are applying quality settings to shaders. However, If the value of the constant changes frequently, specialization constants are not useful, since they require expensive shader re-compiles every time we want to change their value, and we want to avoid that as much as possible in our rendering loop. Nevertheless, it it is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the redendering loop and just swap pipelines as needed while rendering.

Conclusions

Specialization constants are a straight forward yet powerful way to gain control over how shader compilers optimize your code. In my particular pet project, applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.

Finally, although the above covered specialization constants from the point of view of Vulkan, this is really a feature of the SPIR-V language, so it is also available in OpenGL with the GL_ARB_gl_spirv extension, which is core since OpenGL 4.6.

March 18, 2018

In my last update on the Panfrost project, I showed an assembler and disassembler pair for Midgard, the shader architecture for Mali Txxx GPUs. Unfortunately, Midgard assembly is an arcane, unwieldly language, understood by Connor Abbott, myself, and that’s about it besides engineers bound by nondisclosure agreements. You can read the low-level details of the ISA if you’re interested.

In any case, what any driver really needs is not just an assembler but a compiler. Ideally, such a compiler would live in Mesa itself, capable of converting programs written in high level GLSL into an architecture-specific binary.

Such a mammoth task ought to be delayed until after we begin moving the driver into Mesa, through the Gallium3D infrastructure. In any event, back in January I had already begun such a compiler, ingesting NIR, an intermediate representation coincidentally designed by Connor himself. The past few weeks were spent improving and debugging this compiler until it produced correct, reasonably efficient code for both fragment and vertex shaders.

As of last night, I have reached this milestone for simple shaders!

As an example, an input fragment shader written in GLSL might look like:

uniform vec4 uni4;

void main() {
    gl_FragColor = clamp(
        vec4(1.3, 0.2, 0.8, 1.0) - vec4(uni4.z),
        0.0, 1.0);
}

Through the fully free compiler stack, passed through the free diaassembler for legibility, this yields:

vadd.fadd.sat r0, r26, -r23.zzzz
br_cond.write +0
fconstants 1.3, 0.2, 0.8, 1

vmul.fmov r0, r24.xxxx, r0
br_cond.write -1

This is the optimal compilation for this particular shader; the majority of that shader is the standard fragment epilogue which writes the output colour to the framebuffer.

For some background on the assembly, Midgard is a Very Long Instruction Word (VLIW) architecture. That is, multiple instructions are grouped together in blocks. In the disassembly, this is represented by spacing. Each line is an instruction, and blank lines delimit blocks.

The first instruction contains the entirety of the shader logic. Reading it off, it means “using the vector addition unit, perform the saturated floating point addition of the attached constants (register 26) and the negation of the z component of the uniform (register 23), storing the result into register 0”. It’s very compact, but comparing with the original GLSL, it should be clear where this is coming from. The constants are loaded at the end of the block with the fconstants meta instruction.

The other four instructions are the standard fragment epilogue. We’re not entirely sure why it’s so strange – framebuffer writes are fixed from the result of register 0, and are accomplished with a special loop using branching instruction. We’re also not sure why the redundant move is necessary; Connor and I suspect there may be a hardware limitation or errata preventing a br_cond.write instruction from standing alone in a block. Thankfully, we do understand more or less what’s going on, and they appear to be fixed. The compiler is able to generate it just fine, including optimising the code to write into register 0.

As for vertex shaders, well, fragment shaders are simpler than vertex shaders. Whereas the former merely has the aforementioned weird instruction sequence, vertex epilogues need to handle perspective division and viewport scaling, operations which are not implemented in hardware on this embedded GPU. When this is fully implemented, it will be quite a bit more difficult-to-optimise code in the output, although even the vendor compiler does not seem to optimise it. (Perhaps in time our vertex shaders could be faster than the vendor’s compiled shader due to a smarter epilogue!)

Without further ado, an example vertex shader looks like:

attribute vec4 vin;
uniform vec4 u;

void main() {
    gl_Position = (vin + u.xxxx * vec4(0.01, -0.02, 0.0, 0.0)) * (1.0 / u.x);
}

Through the same stack and a stub vertex epilogue which assumes there is no perspective division needed (that the input is normalised device coordinates) and that the framebuffer happens to be the resolution 400x240, the compiler emits:

vmul.fmov r1, r24.xxxx, r26
fconstants 0, 0, 0, 0

ld_attr_32 r2, 0, 0x1E1E

vmul.fmul r4, r23.xxxx, r26
vadd.fadd r5, r2, r4
fconstants 0.01, -0.02, 0, 0

lut.frcp r6.x, r23.xxxx, #2.61731e-39
fconstants 0.01, -0.02, 0, 0

vmul.fmul r7, r5, r6.xxxx

vmul.fmul r9, r7, r26
fconstants 200, 120, 0.5, 0

vadd.fadd r27, r26, r9
fconstants 200, 120, 0.5, 1

st_vary_32 r1, 0, 0x1E9E

There is a lot of room for improvement here, but for now, the important part is that it does work! The transformed vertex (after scaling) must be written to the special register 27. Currently, a dummy varying store is emitted to workaround what appears to be yet another hardware quirk. (Are you noticing a trend here? GPUs are funky.). The rest of the code should be more or less intelligible by looking at the ISA notes. In the future, we might improve the disassembler to hide some of the internal encoding peculiarities, such as the dummy r24.xxxx and #0 arguments for fmov and frcp instructions respectively.

All in all, the compiler is progressing nicely. It is currently using a simple SSA-based intermediate representation which maps one-to-one with the hardware, minus details about register allocation and VLIW. This architecture will enable us to optimise our code as needed in the future, once we write a register allocators and instruction scheduler. A number of arithmetic (ALU) operations are supported, and although there is much work left to do – including generating texture instructions, which were only decoded a few weeks ago – the design is sound, clocking in at a mere 1500 lines of code.

The best part, of course, is that this is no standalone compiler; it is already sitting in our fork of mesa, using mesa’s infrastructure. When the driver is written, it’ll be ready from day 1. Woohoo!

Source code is available; get it while it’s hot!


Getting the shader compiler to this point was a bigger time sink than anticipated. Nevertheless, we did do a bit of code cleanup in the meanwhile. On the command stream side, I began passing memory-resident structures by name rather than by address, slowly rolling out a basic watermark allocator. This step is revealing potential issues in the understanding of the command stream, preparing us for proper, non-replay-based driver development. Textures still remain elusive, unfortunately. Aside from that, however, much of – if not most – of the command stream is well-understood now. With the help of the shader compiler, basic 3D tests like test-triangle-msoothed are now almost entirely understood and for the most part devoid of magic.

Lyude Paul has been working on code clean-up specifically regarding the build systems. Her goal is to let new contributors play with GPUs, rather than fight with meson and CMake. We’re hoping to attract some more people with low-level programming knowledge and some spare time to pitch in. (Psst! That might mean you! Join us on IRC!)

On a note of administrivia, the project name has been properly changed to Panfrost. For some history, over the summer two driver projects were formed: chai, by me, for Midgard; and BiOpenly, by Lyude et al, for Bifrost. Thanks to Rob Clark’s matchmaking, we found each other and quickly realised that the two GPU architectures had identical command streams; it was only the shader cores that were totally redesigned and led to the rename. Thus, we merged to join efforts, but the new name was never officially decided.

We finally settled on the name “Panfrost”, and our infrastructure is being changed to reflect this. The IRC channel, still on Freenode, now redirects to #panfrost. Additionally Freedesktop.org rolled out their new GitLab CE instance, of which we are the first users; you can find our repositories at the Panfrost organisation on the fd.o GitLab.


On Monday, our project was discussed in Robert Foss’s talk “Progress in the Embedded GPU Ecosystem”. Foss predicted the drivers would not be ready for another three years.

Somehow, I have a feeling it’ll be much sooner!

March 12, 2018

It was only a few weeks ago when I posted that the Intel Mesa driver had successfully passed the Khronos OpenGL 4.6 conformance on day one, and now I am very proud that we can announce the same for the Intel Mesa Vulkan 1.1 driver, the new Vulkan API version announced by the Khronos Group last week. Big thanks to Intel for making Linux a first-class citizen for graphics APIs, and specially to Jason Ekstrand, who did most of the Vulkan 1.1 enablement in the driver.

At Igalia we are very proud of being a part of this: on the driver side, we have contributed the implementation of VK_KHR_16bit_storage, numerous bugfixes for issues raised by the Khronos Conformance Test Suite (CTS) and code reviews for some of the new Vulkan 1.1 features developed by Intel. On the CTS side, we have worked with other Khronos members in reviewing and testing additions to the test suite, identifying and providing fixes for issues in the tests as well as developing new tests.

Finally, I’d like to highlight the strong industry adoption of Vulkan: as stated in the Khronos press release, various other hardware vendors have already implemented conformant Vulkan 1.1 drivers, we are also seeing major 3D engines adopting and supporting Vulkan and AAA games that have already shipped with Vulkan-powered graphics. There is no doubt that this is only the beginning and that we will be seeing a lot more of Vulkan in the coming years, so look forward to it!

Vulkan and the Vulkan logo are registered trademarks of the Khronos Group Inc.

This week I wrote a little patch series to get VCHI probing on upstream Raspberry Pi. As we’re building a more normal media stack for the platform, I want to get this upstreamed, and VCHI is at the root of the firmware services for media.

Next step for VCHI upstreaming will be to extract Dave Stevenson’s new VCSM driver and upstream it, which as I understand it lets you do media decode stuff without gpu_mem= settings in the firmware – the firmware will now request memory from Linux, instead of needing a fixed carveout. That driver will also be part of the dma-buf plan for the new v4l2 mem2mem driver he’s been working.

Dave Stevenson has managed to produce a V4L2 mem2mem driver doing video decode/encode. He says it’s still got some bugs, but things look really promising.

In VC4 display, Stefan Schake submitted patches for fixing display plane alpha blending in the DRM hwcomposer for Android, and I’ve merged them to drm-misc-next.

I also rebased my out-of-tree DPI patch, fixed the regression from last year, and submitted patches upstream and downstream (including a downstream overlay). Hopefully this can help other people attach panels to Raspberry Pi.

On the 3D side, I’ve pushed the YUV-import accelerated blit code. We should now be able to display dma-bufs fast in Kodi, whether you’ve got KMS planes or the fallback GL composition.

Also, now that the kernel side has made it to drm-next, I’ve pushed Boris’s patches for vc4 perfmon into Mesa. Now you can use commands like:

apitrace replay application.trace
    --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading

to examine behavior of your GL applications on the HW. Note that each doing –pdraw level tracing (instead of –pframes) means that each draw call will flush the scene, which is incredibly expensive in terms of memory bandwidth.

March 11, 2018

Alt text

A recording of the talk is available here.

Downloads

If you're curious about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of Embedded Linux Conference NA, for hosting a great event.

March 09, 2018
This is the first entry in an on-going series. Here's a list of all entries:
  1. What has TableGen ever done for us?
  2. Functional Programming
  3. Bits
  4. Resolving variables
  5. DAGs
  6. to be continued
Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.

TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.

For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.

The unit of description is the record. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the Instruction class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.

The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.

To get an idea of what TableGen looks like, consider this simple example:
def Plain {
  int x = 5;
}

class Room<string name> {
  string Name = name;
  string WallColor = "white";
}

def lobby : Room<"Lobby">;

multiclass Floor<int num, string color> {
  let WallColor = color in {
    def _left : Room<num # "_left">;
    def _right : Room<num # "_right">;
  }
}

defm first_floor : Floor<1, "yellow">;
defm second_floor : Floor
<2, "gray">;
This example defines 6 records in total. If you have an LLVM build around, just run the above through llvm-tblgen to see them for yourself. The first one has name Plain and contains a single value named x of value 5. The other 5 records have Room as a superclass and contain different values for Name and WallColor.

The first of those is the record of name lobby, whose Name value is "Lobby" (note the difference in capitalization) and whose WallColor is "white".

Then there are four records with the names first_floor_left, first_floor_right, second_floor_left, and second_floor_right. Each of those has Room as a superclass, but not Floor. Floor is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, Floor has two record prototypes, _left and _right. They are instantiated by each of the defm directives. Note how even though def and defm look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.

The Name value of first_floor_left is "1_left" and its WallColor is "yellow", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:
class Foo {
  string salutation = "Hi";
  string message = salutation#", world!";
}

def : Foo {
  let
salutation = "Hello";
}
The message of the anonymous record defined by the def-statement is "Hello, world!".

There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.

For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:
class Tag<int num> {
  int Number = num;
}

class Test<int num> {
  int Number1 = Tag<5>.Number;
  int Number2 = Tag<num>.Number;
  Tag Tag1 = Tag<5>;
  Tag Tag2 = Tag<num>;
}

def : Test<5>;
What are the values in the anonymous record? It turns out that Number1 and Number2 are both 5, but Tag1 and Tag2 refer to different records. Tag1 refers to an anonymous record with superclass Tag and Number equal to 5, while Tag2 also refers to an anonymous record, but with the Number equal to an unresolved variable reference.

This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.

So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.
This is the fifth part of a series; see the first part for a table of contents.

With bit sequences, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. DAG nodes are another; they look a bit like S-expressions:
def op1;
def op2;
def i32:

def Example {
  dag x = (op1 $foo, (op2 i32:$bar, "Hi"));
}
In the example, there are two DAG nodes, represented by a DagInit object in the code. The first node has as its operation the record op1. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or arguments: the first argument is named foo but has no value. The second argument has no name, but it does have another DAG node as its value.

This second DAG node has the operation op2 and two arguments. The first argument is named bar and has value i32, the second has no name and value "Hi".

DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.

There are three main uses of DAGs:
  1. Describing the operands on machine instructions.
  2. Describing patterns for instruction selection.
  3. Describing register files with something called "set theory".
I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.

Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.

I will illustrate some of this with the example of the V_ADD_F32 instruction from the AMDGPU backend. V_ADD_F32 is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as V_ADD_F32_e32.

Let's take a look at some of the fully resolved records produced by the TableGen frontend:
def V_ADD_F32_e32 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);
  string AsmOperands = "$vdst, $src0, $src1";
  ...
}


def anonymous_503 {    // DAGOperand RegisterOperand VOPDstOperand
  RegisterClass RegClass = VGPR_32;
  string PrintMethod = "printVOPDst";
  ...
}
As you'd expect, there is one out operand. It is named vdst and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string "printVOPDst" will be used by the backend that generates the bulk of the instruction printer code, and refers to the method AMDGPUInstPrinter::printVOPDst that is implemented manually.)

There are two in operands. src1 is a 32-bit general purpose vector register and requires no special handling, but src0 supports more complex operands as described in the record VSrc_f32 elsewhere.

Also note the string AsmOperands, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.

This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at V_ADD_F32_e64, the 64-bit encoded version, which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old OpenGL assembly program extensions or with DirectX shader assembly.

The fully resolved records produced by the TableGen frontend are quite a bit more involved:
def V_ADD_F32_e64 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList =
    (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,
         FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,
         clampmod:$clamp, omod:$omod);
  string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";
  list<dag> Pattern =
    [(set f32:$vdst, (fadd
      (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];
  ...
}

def FP32InputMods {     // DAGOperand Operand InputMods FPInputMods
  ValueType Type = i32;
 
string PrintMethod = "printOperandAndFPInputMods";
 
AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;
  ...
}


def FP32InputModsMatchClass {   // AsmOperandClass FPInputModsMatchClass
  string Name = "RegOrImmWithFP32InputMods";
  string PredicateMethod = "isRegOrImmWithFP32InputMods";
  string ParserMethod = "parseRegOrImmWithFPInputMods";
  ...
}
The out operand hasn't changed, but there are now many more special in operands that describe whether those additional features of the instruction are used.

You can again see how records such as FP32InputMods refer to manually implemented methods. Also note that the AsmOperands string no longer refers to src0 or src1. Instead, the printOperandAndFPInputMods method on src0_modifiers and src1_modifiers will print the source operand together with its sign modifiers. Similarly, the special ParserMethod parseRegOrImmWithFPInputMods will be used by the assembly parser.

This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.

Something else is new here: the Pattern. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the SelectionDAG into machine instructions. Let's take this particular pattern apart:
(set f32:$vdst, (fadd ...))
We will match an fadd selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand vdst. (set, fadd and many others are defined in the target-independent include/llvm/Target/TargetSelectionDAG.td.)
(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))
Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by complex patterns. Here's one of them:
def VOP3Mods { // ComplexPattern
  string SelectFunc = "SelectVOP3Mods";
  int NumOperands = 2;
  ...
}
As you'd expect, there's a manually implemented SelectVOP3Mods method. Its signature is
bool SelectVOP3Mods(SDValue In, SDValue &Src,
                    SDValue &SrcMods) const;
It can reject the match by returning false, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into src1 and src1_modifiers in the particular pattern we were studying.

Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the S_BFM_B32 instruction, which generates a bitfield mask:
def anonymous_2373anonymous_2371 {    // Pattern Pat ...
  dag PatternToMatch =
    (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));
  list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];
  ...
}
The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have Pattern as a superclass. In this case, we match an expression of the form ((1 << a) - 1) << b on 32-bit integers into a single machine instruction.

So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.
March 07, 2018
Vulkan 1.1 was officially released today, and thanks to a big effort by Bas and a lot of shared work from the Intel anv developers, radv is a launch day conformant implementation.

https://www.khronos.org/conformance/adopters/conformant-products#submission_308

is a link to the conformance results. This is also radv's first time to be officially conformant on Vega GPUs. 

https://patchwork.freedesktop.org/series/39535/
is the patch series, it requires a bunch of common anv patches to land first. This stuff should all be landing in Mesa shortly or most likely already will have by the time you read this.

In order to advertise 1.1 you need at least a 4.15 Linux kernel.

Thanks to the all involved in making this happen, including the behind the scenes effort to allow radv to participate in the launch day!

March 06, 2018
This is the fourth part of a series; see the first part for a table of contents.

It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in lib/TableGen/ inside the LLVM repository, e.g. here on the GitHub mirror. The backends for LLVM itself live in utils/TableGen/, together with the command line tool's main() function. Clang also has its own backends.

Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:
class Foo<int src> {
  int Src = src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
}

multiclass Foos<int src> {
  def a : Foo<src>;
  let Offset = 2 in
  def b : Foo<src>;
}

foreach i = 0-3 in
defm F#i : Foos<i>;
This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and with my patch series it certainly does work in the natural way. We see four kinds of variable references:
  • internally within a record, such as the initializer of Dst referencing Src and Offset
  • to a class template variable, such as Src being initialized by src
  • to a multiclass template variable, such as src being passed as a template argument for Foo
  • to a foreach iteration variable
As an aside, keep in mind that let in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages let introduces a new variable, but TableGen's let instead overrides the value of a variable that has already been defined elsewhere. In the example above, the let-statement causes the value of Offset to be changed in the record that was instantiated from the Foo class to create the b prototype inside multiclass Foos.

TableGen internally represents variable references as instances of the VarInit class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:
class Foo<int Foo:src = ?> {
  int Src = Foo:src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
  string NAME = ?;
}
As you can see, Foo:src is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).

Similarly, references to foreach iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, TGParser::ParseIDValue is a good place to start.

The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with let-statements, internal references among record variables must be resolved after template arguments.

Actually resolving variable references used to be done by the implementations of the following virtual method of the Init class hierarchy (which represents initializers, i.e. values and expressions):
virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;
This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: R represents the "current" record which is used as a frame of reference for magical lookups in the implementation of !cast; this is a topic for another time, though. At the same time, variables referencing R are supposed to be resolved, but only if RV is null. If RV is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.

This is replaced in my changes with
virtual Init *resolveReferences(Resolver &R) const;
where Resolver is an abstract base class / interface which can lookup values based on their variable names:
class Resolver {
  Record *CurRec;

public:
  explicit Resolver(Record *CurRec) : CurRec(CurRec) {}
  virtual ~Resolver() {}

  Record *getCurrentRecord() const { return CurRec; }
  virtual Init *resolve(Init *VarName) = 0;
  virtual bool keepUnsetBits() const { return false; }
};
The "current record" is used as a reference for the aforementioned magical !casts, and keepUnsetBits instructs the implementation of bit sequences in BitsInit not to resolve to ? (as was explained in the third part of the series). resolve itself is implemented by one of the subclasses, most notably:
  1. MapResolver: Resolve based on a dictionary of name-value pairs.
  2. RecordResolver: Resolve variable names that appear in the current record.
  3. ShadowResolver: Delegate requests to an underlying resolver, but filter out some names.
 This last type of resolver is used by the implementations of !foreach and !foldl to avoid mistakes with nesting. Consider, for example:
class Exclamation<list<string> messages> {
  list Messages = !foreach(s, messages, s # "!");
}

class Greetings<list<string> names>
    : Exclamation&lt!foreach(s, names, "Hello, " # s)>;

def : Greetings<["Alice", "Bob"]>;
This effectively becomes a nested !foreach. The iteration variable is named s in both, so when substituting s for the outer !foreach, we must ensure that we don't also accidentally substitute s in the inner !foreach. We achieve this by having !foreach wrap the given resolver with a ShadowResolver. The same principle applies to !foldl as well, of course.
March 05, 2018

About two weeks ago, I published a screenshot of a smoothed triangle rendered with a free software driver on a Mali T760 with binary shaders.

But… binary shaders? C’mon, I shouldn’t stoop that low! What good is it to have a free software driver if we’re dependant on a proprietary mystery blob to compile our shaders, arguably the most important capability of a modern graphics driver?

There was little excuse – even then the shader instruction set was partially understood through the work of Connor Abbott back in 2013. At the time, Connor decoded the majority of arithmetic (ALU) and load-store instructions; additionally, he wrote a disassembler based on his findings. It is hard to overstate the magnitude of Connor’s contributions here; decoding a modern instruction set like Midgard is a major feat, of comparable difficulty to decoding the GPU’s command stream itself. In any case, though, his work resulted in detailed documentation and a disassembler strictly for prototyping work, never meant for real world use.

Naturally enough, I did the unthinkable, by linking directly to the disassembler’s internal library from the command stream tracer. After cleaning up the disassembler code a bit, massaging its output into normal assembly rather than a collection of notes-to-self, the relevant source code for our smoothed triangle changed from:

FILE *f_shader_12 = fopen("shader_12.bin", "rb");
fread(shader_12, 1, 4096, f_shader_12);
fclose(f_shader_12);

(where shader_12.bin is a nontrivial blob extracted from the command stream containing the compiled shaders as well as some other unused code), to a much more readable:

const char shader_src_2[] = R"(
    ld_vary_16 r0.xy, 0.xyxx, 0xA01E9E

    vmul.fmov r0, r24.xxxx, hr0
    fb.write 0x1808
    
    vmul.fmov r0, r24.xxxx, r0
    fb.write 0x1FF8
)";

pandev_shader_assemble(shader_12 + 288, shader_src_2);

There are still some mystery hex constants there, but the big parts are understood for fragment shaders at least. Vertex shaders are a little more complicated, but having this disassembly will make those much easier to understand as well.

In any event, having this disassembly embedded into the command stream isn’t any good without an assembler…

…so, of course, I then wrote a Midgard assembler. It’s about five hundred lines of Python, plus Pythonised versions of architecture definitions from the disassembler. This assembler isn’t too pretty or performant, but as long as it works, it’s okay; the real driver will use an emitter written directly in C and bypassing the assembly phase.

Indeed, this assembler, although still incomplete in some areas, works very well for the simple shaders we’re currently experimenting with. In fact, a compiled binary can be disassembled and then reassembled with our tools, yielding bit identical output.

That is, we can be even more reckless and call out to this prototype assembler from within the command stream. Look Ma, no blobs!

There is no magic. Although Midgard assembly is a bit cumbersome, I have been able to write some simple fragment shaders in assembly by hand, using only the free toolchain. Woohoo!


Sadly, while Connor’s 2013-era notes were impressive, they were lacking in a few notable areas; in particularly, he had not made any progress decoding texture words. Similarly, the elusive fbwrite field was never filled in. Not an issue – Connor and I decoded much of the texture pipeline, fbwrite, and branching. Many texture instructions can now be disassembled without unknown words! And of course, for these simpler texture instructions, we can reassemble them again bit-identical.


But we’ve been quite busy. Although the above represents quite a bit of code, that didn’t take the entirety of two weeks, of course. The command stream saw plenty of work, too, but that isn’t quite as glamorous as shaders. I decoded indexed draws, which now appear to work flawlessly. More interestingly, I began work investigating texture and sampler descriptors. A handful of fields are known there, as well as the general structure, although I have not yet successfully replayed any textures, nor have I looked into texture swizzling. Additionally, I identified a number of minor fields relating to: glFrontFace, glLineWidth, attribute and uniform count, framebuffer dimensions, depth/stencil enables, face culling, and vertex counts. Together, I estimate I’ve written about 1k lines of code since the last update, which is pretty crazy.

So, what’s next in the pipeline?

Textures, of course! I’d also like to clean up the command stream replays, particularly relating to memory allocation, to ensure there are no large gaps in our understanding of the hardware.

After that, well, it’ll be time to dust off the NIR compiler I began at the end of January… and start moving code into Mesa!

The future is looking bright for the Panfrost driver.

This week I got the new hardware-accelerated blits for YUV import to GL working again.

The pipeline I have is drm_mmal decoding 360 frames of 1080p Big Buck Bunny trailer using MMAL, importing them to GL as an image_external texture, drawing to the back buffer, and pageflipping.

Switching MMAL from producing linear RGBA buffers to linear NV12 buffers improved FPS by 18.4907% +/- 0.746806% (n=7), and to YV12 by 14.4922% +/- 0.569289%. The improvement is slightly understated, as there’s some fixed overhead of waiting for vcsm to time out to indicate that the stream is complete.

I also polished up Dave’s SAND support patch for KMS, and submitted it. This lets video decode into KMS planes skip a copy of the buffers (I didn’t do performance comparisons of this, though).

Patches are submitted, and the next step will be to implement import of SAND buffers in GL to match the KMS support.

February 28, 2018

After the Raspberry Pi visit, I had a week off to wander around the UK with my partner, and now I’m back.

First, I got to fix regressions in Mesa master on both vc4 and vc5. (Oh, how I wish for non-vendor-specific CI of this project). I also wrote 17 patches to fix various compiler warnings that were driving me nuts.

I refactored my VC4 YUV GL import support, and pulled out the basic copying of the incoming linear data into tiled for the 3D engine to use. This is a CPU-side copy, so it’s really slow due to the uncached read, but it means that you can now import YUV textures using EGL_image_external on Mesa master. Hopefully this can enable Kodi devs to start playing with this on their KMS build.

I’ve also rewritten the hardware-accelerated YUV blit code to hopefully be mergeable. Now I just need to stabilize it.

In VC5 land, I’ve tested and pushed a couple of new fixes to enable 3D textures.

On the kernel side, I’ve merged a bunch of DT and defconfig patches for Pi platform enabling, and sent them upstream. In particular I want to call out Baruch Siach’s firmware GPIO expander patch series, which took several revisions to get accepted (sigh, DT), but will let us do proper Pi3 HDMI hotplug detection and BT power management. Boris’s merged patch to forward-port my I2C fix also apparently fixes some EDID detection on HDMI monitors, which will be good news for people trying to switch to KMS.

February 26, 2018
Edit 2018-02-26: renamed from libevdev-python to python-libevdev. That seems to be a more generic name and easier to package.

Last year, just before the holidays Benjamin Tissoires and I worked on a 'new' project - python-libevdev. This is, unsurprisingly, a Python wrapper to libevdev. It's not exactly new since we took the git tree from 2016 when I was working on it the first time round but this time we whipped it into a better shape. Now it's at the point where I think it has the API it should have, pythonic and very easy to use but still with libevdev as the actual workhorse in the background. It's available via pip3 and should be packaged for your favourite distributions soonish.

Who is this for? Basically anyone who needs to work with the evdev protocol. While C is still a thing, there are many use-cases where Python is a much more sensible choice. The python-libevdev documentation on ReadTheDocs provides a few examples which I'll copy here, just so you get a quick overview. The first example shows how to open a device and then continuously loop through all events, searching for button events:


import libevdev

fd = open('/dev/input/event0', 'rb')
d = libevdev.Device(fd)
if not d.has(libevdev.EV_KEY.BTN_LEFT):
print('This does not look like a mouse device')
sys.exit(0)

# Loop indefinitely while pulling the currently available events off
# the file descriptor
while True:
for e in d.events():
if not e.matches(libevdev.EV_KEY):
continue

if e.matches(libevdev.EV_KEY.BTN_LEFT):
print('Left button event')
elif e.matches(libevdev.EV_KEY.BTN_RIGHT):
print('Right button event')
The second example shows how to create a virtual uinput device and send events through that device:

import libevdev
d = libevdev.Device()
d.name = 'some test device'
d.enable(libevdev.EV_REL.REL_X)
d.enable(libevdev.EV_REL.REL_Y)
d.enable(libevdev.EV_KEY.BTN_LEFT)
d.enable(libevdev.EV_KEY.BTN_MIDDLE)
d.enable(libevdev.EV_KEY.BTN_RIGHT)

uinput = d.create_uinput_device()
print('new uinput test device at {}'.format(uinput.devnode))
events = [InputEvent(libevdev.EV_REL.REL_X, 1),
InputEvent(libevdev.EV_REL.REL_Y, 1),
InputEvent(libevdev.EV_SYN.SYN_REPORT, 0)]
uinput.send_events(events)
And finally, if you have a textual or binary representation of events, the evbit function helps to convert it to something useful:

>>> import libevdev
>>> print(libevdev.evbit(0))
EV_SYN:0
>>> print(libevdev.evbit(2))
EV_REL:2
>>> print(libevdev.evbit(3, 4))
ABS_RY:4
>>> print(libevdev.evbit('EV_ABS'))
EV_ABS:3
>>> print(libevdev.evbit('EV_ABS', 'ABS_X'))
ABS_X:0
>>> print(libevdev.evbit('ABS_X'))
ABS_X:0
The latter is particularly helpful if you have a script that needs to analyse event sequences and look for protocol bugs (or hw/fw issues).

More explanations and details are available in the python-libevdev documentation. That doc also answers the question why python-libevdev exists when there's already a python-evdev package. The code is up on github.