planet.freedesktop.org
June 23, 2017
As mentioned in my previous blog, the X.Org Foundation now wants us to blog every week. Whilst that means shorter blogs (last week’s was a tad long), it also means that there isn’t much to blog about if I didn’t do much in a week. Such is the case for this week, sadly. It’s the last week of university, and so there are a few assignment deadlines that I needed to complete; I haven’t been able to invest as much time into my project as I would have wanted.

In this week’s article for my ongoing Google Summer of Code (GSoC) project I planned on writing about the basic idea behind the project, but I reconsidered and decided to first give an overview on how Xwayland functions on a high-level and in the next week take a look at its inner workings in detail. The reason for that is, that there is not much Xwayland documentation available right now. So these two articles are meant to fill this void in order to give interested beginners a helping hand. And in two weeks I’ll catch up on explaining the project’s idea.

As we go high level this week the first question is, what is Xwayland supposed to achieve at all? You may know this. It’s something in a Wayland session ensuring that applications, which don’t support Wayland but only the old Xserver still function normally, i.e. it ensures backwards compatibility. But how does it do this? Before we go into this, there is one more thing to talk about, since I called Xwayland only something before. What is Xwayland exactly? How does it look to you on your Linux system? We’ll see in the next week that it’s not as easy to answer as the following simple explanation makes it appear, but for now this is enough: It’s a single binary containing an Xserver with a special backend written to communicate with the Wayland compositor active on your system - for example with KWin in a Plasma Wayland session.

To make it more tangible let’s take a look at Debian: There is a package called Xwayland and it consists of basically only the aforementioned binary file. This binary gets copied to /usr/bin/Xwayland. Compare this to the normal Xserver provided by X.org, which in Debian you can find in the package xserver-xorg-core. The respective binary gets put into /usr/bin/Xorg together with a symlink /usr/bin/X pointing to it.

While the latter is the central building block in an X session and therefore gets launched before anything else with graphical output, the Xserver in the Xwayland binary works differently: It is embedded in a Wayland session. And in a Wayland session the Wayland compositor is the central building block. This means in particular that the Wayland compositor also takes up the role of being the server, who talks to Wayland native applications with graphical output as its clients. They send request to it in order to present their painted stuff on the screen. The Xserver in the Xwayland binary is only a necessary link between applications, which are only able to speak to an Xserver, and the Wayland compositor/server. Therefore the Xwayland binary gets launched later on by the compositor or some other process in the workspace. In Plasma it’s launched by KWin after the compositor has initialized the rendering pipeline. You find the relevant code here.

Although in this case KWin also establishes some communication channels with the newly created Xwayland process, in general the communication between Xwayland and a Wayland server is done by the normal Wayland protocoll in the same way other native Wayland applications talk to the compositor/server. This means the windows requested by possibly several X based applications and provided by Xwayland acting as an Xserver are translated at the same time by Xwayland to Wayland compatible objects and, acting as a native Wayland client, send to the Wayland compositor via the Wayland protocol. These windows look to the Wayland compositor just like the windows - in Wayland terminology surfaces - of every other Wayland native application. When reading this keep in mind, that an application in Wayland is not limited to using only one window/surface but can create multiple at the same time, so Xwayland as a native Wayland client can do the same for all the windows created for all of its X clients.

In the second part next week we’ll have a close look at the Xwayland code to see how Xwayland fills its role as an Xserver in regards to its X based clients and at the same time acts as a Wayland client when facing the Wayland compositor.

June 20, 2017

Felt it been to long since I did another Fedora Workstation update. We spend a lot of time trying to figure out how we can best spend our resources to produce the best desktop possible for our users, because even though Red Hat invests more into the Linux desktop than any other company by quite a margin, our resources are still far from limitless. So we have a continuous effort of asking ourselves if each of the areas we are investing in are the right ones that give our users the things they need the most, so below is a sampling of the things we are working on.

Improving integration of the NVidia binary driver
This has been ongoing for quite a while, but things have started to land now. Hans de Goede and Simone Caronni has been collaboring, building on the work by NVidia and Adam Jackson around glvnd. So if you set up Simones NVidia repository hosted on negativo17 you will be able to install the Nvidia driver without any conflicts with the Mesa stack and due to Hans work you should be fairly sure that even if the NVidia driver stops working with a given kernel update you will smoothly transition back to the open source Nouveau driver. I been testing it on my own Lenovo P70 system for the last week and it seems to work well under X. That said once you install the binary NVidia driver that is what your running on, which is of course not the behaviour you want from a hybrid graphics system. Fixing that last issue requires further collaboration between us and NVidia.
Related to this Adam Jackson is currently working on a project he calls glxmux. glxmux will allow you to have more than one GLX implementation on the system, so that you can switch between Mesa GLX for the Intel integrated graphics card and NVidia GLX for the binary driver. While we can make no promises we hope to have the framework in place for Fedora Workstation 27. Having that in place should allow us to create a solution where you only use the NVidia driver when you want the extra graphics power which will of course require significant work from Nvidia to enable it on their side so I can’t give a definite timeline for when all the puzzle pieces are in place. Just be assured we are working on it and talking regularly to NVidia about it. I will let you know here as soon as things come together.

On the Wayland side the Jonas Ådahl is working on putting the final touches on Hybrid Graphics support

Fleet Commander ready for take-off
Another major project we been working on for a long time in Fleet Commander. Fleet Commander is a tool to allow you to manage Fedora and RHEL desktops centrally. This is a tool targeted at for instance Universities or companies with tens, hundreds or thousands of workstation installation. It gives you a graphical browser based UI (accessible through Cockpit) to create configuration profiles and deploy across your organization. Currently it allows you to control anything that has a gsetting associated with it like enabling/disabling extensions and setting configuration settings in GTK+ and GNOME applications. It allows you to configure Network Manager settings so if your updating the company VPN or proxy settings you can easily push those changes out to all user in the organization. Or quickly migrate Evolution email settings to a new email server. The tool also allows you to control recommended applications in the Software Center and set bookmarks in Firefox. There is also support for controlling settings inside LibreOffice.

All this features can be set and controlled on either a user level or a group level or organization wide due to the close integration we have with FreeIPA suite of tools. The data is stored inside your organizations LDAP server alongside other user information so you don’t need to have the clients connect to a new service for this, and while it is not there in this initial release we will in the future also support Active Directory.

The initial release and Fleet Commander website will be out alongside Fedora Workstation 26.

PipeWire
I talked about PipeWire before, when it was still called Pinos, but the scope and ambition for the project has significantly changed since then. Last time when I spoke about it the goal was just to create something that could be considered a video equivalent of pulseaudio. Wim Taymans, who you might know co-created GStreamer and who has been a major PulseAudio contributor, has since expanded the scope and PipeWire now aims at unifying linux Audio and Video. The long term the goal is for PipeWire to not only provide handling of video streams, but also handle all kinds of audio. Due to this Wim has been spending a lot of time making sure PipeWire can handle audio in a way that not only address the PulseAudio usecases, but also the ones handled by Jack today. A big part of the motivation for this is that we want to make Fedora Workstation the best place to create content and we want the pro-audio crowd to be first class citizens of our desktop.

At the same time we don’t want to make this another painful subsystem transition so PipeWire so we will need to ensure that PulseAudio applications can still be run without modification.

We expect to start shipping PipeWire with Fedora Workstation 27, but at that point only have it handle video as we need this to both enable good video handling for Flatpak applications through a video portal, but also to provide an API for applications that want to do screen capture under Wayland, like web browser applications offering screen sharing. We will the bring the audio features onboard in subsequent releases as we also try to work with the Jack and PulseAudio communities to make this a joint effort. We are also working now on a proper website for PipeWire.

Red Hat developer integration
A feature we are quite excited about is the integration of support for the Red Hat developer account system into Fedora. This means that you should be able to create a Red Hat developer account through GNOME Online accounts and once you have that account set up you should be able to easily create Red Hat Enterprise Linux virtual machines or containers on your Fedora system. This is a crucial piece for the developer focus that we want the workstation to have and one that we think will make a lot of developers life easier. We where originally hoping to have this ready for Fedora Workstaton 26, but atm it looks more likely to hit Fedora Workstation 27, but we will keep you up to date as this progresses.

Fractional scaling for HiDPI systems
Fedora Workstation has been leading the charge in supporting HiDPI on Linux and we hope to build on that with the current work to enable fractional scaling support. Since we introduced HiDPI support we have been improving it step by step, for instance last year we introduced support for dealing with different DPI levels per monitor for Wayland applications. The fractional scaling work will take this a step further. The biggest problem it will resolve is that for certain monitor sizes the current scaling options either left things to small or to big. With the fractional scaling support we will introduce intermediate steps, so that you can scale your interface 1.5x times instead of having to go all the way to 2. The set of technologies we are developing for handling fractional scaling should also allow us to provide better scaling for XWayland applications as it provides us with methods for scaling that doesn’t need direct support from the windowing system or toolkit.

GNOME Shell performance
Carlos Garnacho has been doing some great work recently improving the general performance of GNOME Shell. This comes on top of his earlier performance work that was very well received. How fast/slow GNOME shell is often a subjective thing, but reducing overhead where we can is never a bad thing.

Flatpak building
Owen Taylor has been working hard on putting the pieces in place for start large scale Flatpak building in Fedora. You might see a couple of test flatpaks appear in a Fedora Workstation 26 timeframe, but the goal is to have a huge Flatpak catalog ready in time for Fedora Workstation 27. Essentially what we are doing is making it very simple for a Fedora maintainer to build a Flatpak of the application they maintain through the Fedora package building infrastructure and push that Flatpak into a central Flatpak registry. And while of course this is mainly meant to be to the benefit of Fedora users there is of course nothing stopping other distributions from offering these Flatpak packaged applications to their users also.

Atomic Workstation
Another effort that is marching forward is what we call Atomic Workstation. The idea here is to have an immutable OS image kinda like what you see on for instance Android devices. The advantage to this is that the core of the operating system gets tested and deployed as a unit and the chance of users ending with broken systems decrease significantly as we don’t need to rely on packages getting applied in the correct order or scripts executing as expected on each individual workstation out there. This effort is largely based on the Project Atomic effort, and the end goal here is to have a image based OS install and Flatpak based applications on top of it. If you are very adventerous and/or want to help out with this effort you can get the ISO image installer for Atomic Workstation here.

Firmware handling
Our Linux Firmware project is still going strong with new features being added and new vendors signing on. As Richard Hughes recently blogged about the latest vendor joining the effort is Logitech who now will upload their firmware into the service so that you can keep your Logitech peripherals updated through it. It is worthwhile pointing out here how we worked with Logitech to make this happen, with Richard working on the special tooling needed and thus reducing the threshold for Logitech to start offering their firmware through the service. We have other vendors we are having similar discussions and collaborations with so expect to see more. At this point I tend to recommend people get a Dell to run Linux, due to their strong support for efforts such as the Linux Firmware Service, but other major vendors are in the final stages of testing so expect more major vendors starting to push firmware updates soon.

High Dynamic Range
The next big thing in the display technology field is HDR (High Dynamic Range). HDR allows for deeper more vibrant colours and is a feature seen on a lot of new TVs these days and game consoles like the Playstation 4 support it. Computer monitors are appearing on the market too now with this feature, for instance the Dell UP2718Q. We want to ensure Fedora and Linux is a leader here, for the benefit of video and graphics artists using Fedora and Red Hat Enterprise Linux. We are thus kicking of an effort to make sure this technology mature as quickly as possible and be fully supported. We are not the only ones interested in this so we will hopefully be collaborating with our friends at Intel, AMD and NVidia on this. We hope to have the first monitors delivered to our office within a few weeks.

Codecs
While playback these days have moved to streaming where locally installed codecs are of less importance for the consumption usecase, having a wide selection of codecs available is still important for media editing and creation usecases, so we want you to be able to load a varity of old media files into you video editor for instance. Luckily we are at a crossroads now where a lot of widely used codecs have their essential patents expire (mp3, ac3 and more) while at the same time the industry focus seems to have moved to royalty free codec development moving forward (Opus, VP9, Alliance for Open Media). We have been spending a lot of time with the Red Hat legal team trying to clear these codecs, which resulted in mp3 and AC3 now shipping in Fedora Workstation. We have more codecs on the way though, so this effort is in no way over. My goal is that over the course of this year the situation of software patents being a huge issue when dealing with audio and video codecs on Linux will be considered a thing of the past. I would like to thank the Red Hat legal team for their support on this issue as they have had to spend significant time on it as a big company like Red Hat do need to do our own due diligence when it comes to these things, we can’t just trust statements from random people on the internet that these codecs are now free to ship.

Battery life
We been looking at this for a while now and hope to be able to start sharing information with users on which laptops they should get that will have good battery life under Fedora. Christian Kellner is now our point man on battery life and he has taken up improving the Battery Bench tool that Owen Taylor wrote some time ago.

QtGNOME platform
We will have a new version of the QtGNOME platform in Fedora 26. For those of you who have not yet heard of this effort it is a set of themes and tools to ensure that Qt applications runs without any major issues under GNOME 3. With the new version the theming expands to include the accessibility and dark themes in Adwaita, meaning that if you switch to one of these themes under GNOME shell it will also switch your Qt applications over. We are also making sure things like cut’n paste and drag and drop works well. The version in Fedora Workstation 26 is a big step forward for this effort and should hopefully make Qt applications be first class citizens under your Fedora Workstation desktop.

Wayland polish
Ever since we switched the default to Wayland we have kept the pressure up and kept fixing bugs and finding solutions for corner cases. The result should be an improved Wayland experience in Fedora Workstation 26. A big thanks for Olivier Fourdan, Jonas Ådahl and the whole Wayland community for their continued efforts here. Two major items Jonas is working on for instance is improving fractional scaling, to ensure that your desktop scales to an optimal size on HiDPI displays of various sizes. What we currently have is limited to 1x or 2x, which is either to small or to big for some screens, but with this work you can also do 1.5x scaling. He is also working on preparing an API that will allow screen sharing under Wayland, so that for instance sharing your slides over video conferencing can work under Wayland.

June 19, 2017

Introducing casync

In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).

The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that's what makes it useful for a variety of use-cases that other tools can't cover that well.

Why?

I created casync after studying how today's popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).

Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don't think any of the tools mentioned above are really good on more than a small subset of these points.

Specifically: Docker's layered tarball approach dumps the "delta" question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there's no use for it there, and likely requires substantially more disk space and download sizes.

OSTree's serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.

Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it's not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn't happy with the security and reproducibility properties of these systems. In today's world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there's only one valid representation of the data to deliver, that can easily be verified.

What casync Is

So much about the background why I created casync. Now, let's have a look what casync actually is like, and what it does. Here's the brief technical overview:

Encoding: Let's take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk's contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let's call this directory a "chunk store". At the same time, generate a "chunk index" file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.

Decoding: Let's take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.

Finally, let's put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.

The "chunking" algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.

Here's a diagram, hopefully explaining a bit how the encoding process works, wasn't it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)

Details

Note that casync operates on two different layers, depending on the use-case of the user:

  1. You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.

Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync's --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.

  2. When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.

  3. casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don't need FAT file bits (and I figure it's pretty likely you don't), then they won't be stored.

  4. The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that's the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the "nodump" (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.

  12. When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to "dm-verity", as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel's block layer. This feature is implemented though Linux' NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux' FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that's hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there's little point in seeding the white-space after the file system within the partition.

Example Command Lines

Here's how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data only. Now let's make this more interesting, and reference remote resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is "smart": this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.

Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let's create a chunk index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.

Now let's make a .caidx file available locally as a mounted file system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let's make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device node path to STDOUT.

As mentioned, casync is big about reproducibility. Let's make use of that to calculate the a digest identifying a very specific version of a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.

Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)'s immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.

What casync isn't

  1. casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google's Courgette.

  2. casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I'd like to make it useful for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.

  3. In the longer run, I'd like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync's functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME's gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync's default HTTP/HTTPS back-ends built on CURL with GNOME's own HTTP implementation, in order to share cookies, certificates, … There's also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.

Further plans are listed tersely in the TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and that's what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn't really make sense on non-Linux systems), as long as they don't interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.

  3. Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn't. The reflink magic in casync is employed when the file system permits it, and it's good to have it, but it's not a requirement, and casync will implicitly fall back to copying when it isn't available. Note that casync supports a number of file system features on a variety of file systems that aren't available everywhere, for example FAT's system/hidden file flags or xfs's projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don't see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.

  5. Are the .caidx/.caibx and .catar file formats open and documented?casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It's definitely my intention to add comprehensive docs for both formats however. Don't forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casync isn't "just like" some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.

  7. Why did you invent your own serialization format for file trees? Why don't you just use tar? — That's a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn't enforce reproducibility. It also doesn't really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the moment not. That's not because I wouldn't want it to, but simply because I am not a guru of either of these systems, and didn't want to implement something I do not fully grok nor can test. If you look at the sources you'll find that there's already some definitions in place that keep room for them though. I'd be delighted to accept a patch implementing this fully.

  9. What about delivering squashfs images? How well does chunking work on compressed serializations? – That's a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let's say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn't apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync's --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It's a synchronizing tool, hence the -sync suffix, following rsync's naming. It makes use of the content-addressable concept of git hence the ca- prefix.

  11. Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that's up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.

Note that casync is an Open Source project: if it doesn't do exactly what you need, prepare a patch that adds what you need, and we'll consider it.

If you are interested in the project and would like to talk about this in person, I'll be presenting casync soon at Kinvolk's Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.

The All Systems Go! 2017 Call for Participation is Now Open!

We’d like to invite presentation proposals for All Systems Go! 2017!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

We are now accepting submissions for presentation proposals. In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

Please submit your proposals by September 3rd. Notification of acceptance will be sent out 1-2 weeks later.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!

I just released the first release candidate for libinput 1.8. Aside from the build system switch to meson one of the more visible things is that the helper tools have switched from a "libinput-some-tool" to the "libinput some-tool" approach (note the space). This is similar to what git does so it won't take a lot of adjustment for most developers. The actual tools are now hiding in /usr/libexec/libinput. This gives us a lot more flexibility in writing testing and debugging tools and shipping them to the users without cluttering up the default PATH.

There are two potential breakages here, one is that the two existing tools libinput-debug-events and libinput-list-devices have been renamed too. We currently ship compatibility wrappers for those but expect those wrappers to go away with future releases. The second breakage is of lesser impact: typing "man libinput" used to bring up the man page for the xf86-input-libinput driver. Now it brings up the man page for the libinput tool which then points to the man pages of the various features. That's probably a good thing, it puts the documentation a bit closer to the user. For the driver, you now have to type "man 4 libinput" though.

Last week I got review on the new interface for tiling for vc4, and landed it in the kernel. I’ve submitted Mesa patches to the list, landed the kernel backport in Raspbian, and have sent Simon (the package maintainer at the Foundation) the Mesa patchset to backport all of my work from the last 6 months (unfortunately Raspbian generally doesn’t update Mesa past the Debian stable release, so graphics work for them has long delays). I also worked with Dom at the Foundation to get an interface for the firmwarekms mode to do tiled scanout, so it will be as fast as the open source driver.

I also reworked the official 7” touchscreen panel driver again. I had initially written it as a DSI panel driver, then got told it should be a bridge driver plus a panel driver, then the bridge maintainers told me it should be only one panel driver but attach to I2C. Maybe this version will stick, who knows.

My Mesa side for pl111 is now in master, so I’ve got the second platform for the vc4 driver basically done. I can’t wait to do another one.

I got a question about whether VC4 could do EGL_ANDROID_native_fence, to allow using Android’s hwcomposer (the display server that puts surfaces in KMS planes if it can). I spent a while looking into it, and found that while code had landed in Mesa for other drivers, the piglit tests had been left around untouched for the last half a year. I reviewed them and hopefully they can land soon. Unfortunately, they don’t cover the important functionality of the extension, so I can’t use them to actually write the driver code (which doesn’t look very hard) until I build some more tests.

On the non-vc4 kernel side, I prepared and submitted pull requests for 4.13. We’ve got USB OTG support for the Pi Zero, SDHOST enabled by default finally, thermal enabled by default finally, and the RPi3 DT being installed for 32-bit kernels. Three of these took months longer than they should have, because the Linux kernel has the worst software development process I’ve worked with since XFree86.

June 16, 2017
This is going to be a long one, so grab a drink and a snack and buckle up! Incidentally, the X.Org foundation has asked us, the 2017 GSoC students, to blog weekly so from now on I will do so; which will also mean smaller blogs in the future. I made a schedule to go with my proposal in which I divided the coding period into two week sprints to plan out my project.

There is a saying that persistence is the key to success. Not that I’m always following this advice, but I did luckily earlier this year when I had to decide if I wanted to apply again for a Google Summer of Code (GSoC) spot after my application in the last year was rejected by my organisation of choice back then. I talked about it in one of my last posts. Anyway, thanks to being persistent this year one of my project ideas got accepted and I now have the opportunity to work on a very interesting project concerning Xwayland, this time for the X.Org Foundation.

The project is kind of difficult though. At least it feels this way to me right now, after I’ve already spent quite some time on digging into it and getting a feel for it.

The main reason for the difficulty is, that various components of the XServer are in play and documentation for them is most often missing. Would I be on my own, I could basically only work with the code and try to comprehend the steps done one after the other. I’m not on my own though! With Daniel Stone I have one of the core developers of several key parts of the Linux graphic stack as my mentor, who seems to really want to help me to understand the difficult stuff on a large scale and still gives concrete advice on what to do next. He even drew a diagram for me!

I plan on publishing in future blog posts resources like this otherwise clearly underused diagram, since such material to understand the XServer code is otherwise only sparsely to find on the web. Additionally I’ll try to explain the idea of my project and the general structure of the code I’m dealing with in my own words. Regarding these future posts a nasty surprise for me was this week, that it’s expected from me to publish one blog post per week about my project. This somewhat spoils my plans for this blog, where I wanted to publish very few but in comparison more sophisticated posts. Nevertheless you can therefore expect to have frequent weekly posts until August. Next week I want to explain the basic idea of my project.

June 15, 2017
For the last couple of months I've been working on improving Linux support for Intel Bay and Cherry Trail based tablets and laptops as a spare-time project.

I'm happy to report that quite a bit of work on this has landed for 4.12, specifically the following fixes / improvements have landed:

  • LCD panel not working with the i915 driver on some devices

  • Battery monitoring not working on most Bay and Cherry devices (battery monitoring did not work on any device using the very popular AXP288 PMIC)

  • Backlight control not working on many devices (you need to builtin the pwm-lpss drivers, or add them to your initrd for this fix to work)

  • Volume and power buttons not working on Cherry Trail devices shipping with Win10

  • The microsd slot on the Asus Transformer T100TA not working

  • Preliminary (incomplete) support Cherry Trail devices with a Whiskey Cove PMIC, some bits to get e.g. battery monitoring working are still missing

  • Added the rtl8723bs driver to staging, enabling ootb support for wifi on many devices

I've also prepared a patch for the Fedora kernel config to enable the necessary new drivers so this should all also work with the upcoming 4.12
kernel for Fedora.
June 09, 2017
For Fedora 27 I'm working on making Fedora 27 a fully integrated VirtualBox guest, the plan is to have Fedora 27 Workstation ship with the VirtualBox Guest Additions installed out of the box, so that cut and paste, file sharing, etc. all will work out of the box for users running Fedora 27 in a VirtualBox vm.

As a first step towards this I'm working on getting the VirtualBox guest kernel drivers merged into the mainline Linux kernel. Up until now this has never been done because of userspace ABI stability concerns surrounding the guest drivers.

I've been talking to VirtualBox upstream about mainlining the guest drivers and VirtualBox upstream has agreed to consider the userspace ABI stable and only extend it in a backwards compatible manner.

I'm about to finish cleaning up the vboxvideo drm/kms driver, when I started working on this the files under /usr/src/vboxguest/vboxvideo as installed by VirtualBox 5.1.18 Guest Additions had a total linecount of 52681 lines. I've been submitting cleanups to VirtualBox upstream and the current VirtualBox upstream svn code is now under 8000 lines.

Update: I've submitted the vboxvideo driver for inclusion into drivers/staging, see this mailinglist thread.
June 03, 2017
This will be a quick blog post on the previous two weeks. During this time, the community bonding period ended and the coding phase officially begun! I spent most of the first week working on university projects, so that those are as done as they can be at this point and I have little else to work on besides Piper. On the second week, I thought I would get a few days head start on my schedule by starting early on the mockups.
June 02, 2017

Back in 2003, the pressure range value for Wacom pen tablets was set to 2048. That's a #define in the driver, but it shouldn't matter because we also advertise this range as part of the device description in the X Input protocol. Clients should be using that advertised min/max range and scale appropriately, in the same way as they should be doing this for the x/y axis ranges.

Fast-forward to 2017 and we changed the pressure range. New Wacom devices now use ~8000 levels, but we opted to just #define it in the driver to 65536 and be done with it. We now scale all models into that range, with varying granularity based on the physical hardware. It shouldn't matter because it's not tied to a reliable physical property anyway and the only thing that matters is the percentage of the max value (which is why libinput just gives you a [0, 1] range. Hindsight is a bliss.).

Slow-forward to 2017-and-a-bit and we received complaints that pressure handling is now broken. Turns out that some applications hardcoded the old 2048 range and now get confused, because virtually any pressure will now hit that maximum. Since those applications are largely proprietary ones and cannot be updated easily, we needed a workaround to this. Jason Gerecke from Wacom got busy and we now have a "Pressure2K" option available in the driver. If set, this option will scale everything into the 2048 range to make sure those applications still work. To get this to work, the following xorg.conf.d snippet is recommended:


Section "InputClass"
Identifier "Wacom pressure compatibility"
MatchDriver "wacom"
Option "Pressure2K" "true"
EndSection
Put it in a file that sorts higher than the wacom driver itself (e.g. /etc/X11/xorg.conf.d/99-wacom-pressure2k.conf) and restart X. Check the Xorg.log/journal for a "Using 2K pressure levels" message, then verify it works by running xinput list "device name". xinput should show a range of 0 to 2048 on the third valuator.

$> xinput list "Wacom Intuos4 6x9 Pen stylus"
Wacom Intuos4 6x9 Pen stylus id=25 [slave pointer (2)]
Reporting 8 classes:
Class originated from: 25. Type: XIButtonClass
Buttons supported: 9
Button labels: None None None None None None None None None
Button state:
Class originated from: 25. Type: XIKeyClass
Keycodes supported: 248
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 0:
Label: Abs X
Range: 0.000000 - 44704.000000
Resolution: 200000 units/m
Mode: absolute
Current value: 22340.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 1:
Label: Abs Y
Range: 0.000000 - 27940.000000
Resolution: 200000 units/m
Mode: absolute
Current value: 13970.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 2:
Label: Abs Pressure
Range: 0.000000 - 2048.000000
Resolution: 1 units/m
Mode: absolute
Current value: 0.000000

Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 3:
Label: Abs Tilt X
Range: -64.000000 - 63.000000
Resolution: 57 units/m
Mode: absolute
Current value: 0.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 4:
Label: Abs Tilt Y
Range: -64.000000 - 63.000000
Resolution: 57 units/m
Mode: absolute
Current value: 0.000000
Class originated from: 25. Type: XIValuatorClass
Detail for Valuator 5:
Label: Abs Wheel
Range: -900.000000 - 899.000000
Resolution: 1 units/m
Mode: absolute
Current value: 0.000000

This is an application bug, but this workaround will make sure new versions of the driver can be used until those applications have been fixed. The option will be available in the soon-to-be-released xf86-input-wacom 0.35. The upstream commit is d958ab79d21b57141415650daac88f9369a1c861.

Edit 02/06/2017: wacom devices have 8k pressure levels now.

It’s been a busy month. Part of that is that I took a week off to go to our Burning Man regional event, which threw a serious wrench into my status updates.

Hans Verkuil now has HDMI CEC working, and is polishing the patches up for inclusion in the kernel. Unfortunatelly we set his work back a bit with Boris’s new HDMI power management code, and that’s getting resolved now. He also tripped over a bug in my new dma-buf reservation code that I helped track down.

Boris has the transposer working and has reviewed the core writeback code, so hopefully we can land it soon. I’m really excited to use this for testing tiled display support (see below).

My work for the last couple of weeks has been in figuring out what to do about the fact that the new Raspbian display stack (vc4 firmwarekms, glamor, compton compositor) takes way more memory than the old display stack (fbdev on the firmware, no glamor, no compositing). It’s been pretty easy for users to hit the 256MB CMA limit, and we aren’t even using 256MB of CMA on the Pi0/1 due to its only having 512MB of system memory.

One of the ideas would be to page GPU allocations out to system memory. That only gets us from 256MB to <1GB of memory available. It’s not like you want to be swapping to SD cards. If we’re hitting 256MB this easily, we need to figure out how to fix that.

The first step was to figure out where our memory was going. I wrote a hack that had the kernel give names to the BOs it allocated, and added a new UAPI that let userspace attach more descriptive names to BOs. With that, I added a little hack to Mesa to attach some descriptions. I then made /debug/dri/0/bo_stats print out the stats on count/size per name, rather than just total. Here’s the output of LXDE with compton as I was dragging a terminal around:

 scanout resource 1920x1080@32/0:  48600kb BOs (6)
    tiling shadow 1920x1080:  32640kb BOs (4)
		   overflow:  16384kb BOs (1)
    resource 1920x1080@32/0:  16320kb BOs (2)
    resource 1024x1024@32/0:   4096kb BOs (1)
      tiling shadow 661x411:   2184kb BOs (2)
      resource 657x386@32/0:   1092kb BOs (1)
  scanout resource 661x411@32/0:   1068kb BOs (1)
     resource 1048576x1@8/0:   1024kb BOs (1)
     resource 1024x1024@8/0:   1024kb BOs (1)
		   resource:    888kb BOs (20)
      tiling shadow 1920x26:    240kb BOs (1)
      resource 1920x26@32/0:    240kb BOs (1)
		     shader:    236kb BOs (56)
  scanout resource 1920x26@32/0:    196kb BOs (1)
		 mesa cache:    152kb BOs (5)
       resource 659x20@32/0:     84kb BOs (1)
       resource 65536x1@8/0:     64kb BOs (1)
      resource 128x128@32/0:     64kb BOs (1)
	  resource 1x1@32/0:     56kb BOs (14)
	      dumb 64x64@32:     48kb BOs (3)
	resource 659x1@32/0:     36kb BOs (3)
		      cache:     28kb BOs (4)
			RCL:     20kb BOs (2)
	 resource 26x1@32/0:     16kb BOs (4)
	resource 1x386@32/0:     16kb BOs (2)
	resource 1x361@32/0:     16kb BOs (2)
	 resource 1x25@32/0:     16kb BOs (4)
	resource 609x1@32/0:     12kb BOs (1)
	resource 607x1@32/0:     12kb BOs (1)
	resource 12x18@32/0:      4kb BOs (1)
			BCL:      4kb BOs (1)

This horrifying. First, this is already after a fix I’ve landed that kept us from leaking an extra overflow BO, that cost 16MB at all times. Next, 6 full-screen scanout resources? I expected 3, certainly, due to compton’s pageflipping (though for a compositor, keeping 3 is bad: I should have GPU idle time built into my pipeline anyway, as presumably GPU work is being done by other clients to generate the frames I’m compositing). Our non-pageflipping scanout buffer is probably another one (can we free the screen pixmap’s contents during pageflipping? That would sure be nice). Maybe the file manager is drawing into a window rather than the root window, to make the another one. There’s still one more I haven’t accounted for, though.

More embarassing is the tiling shadow allocations. We do this in vc4 because it can’t texture from linear BOs, so we allocate a shadow resource that we blit a tiled copy into. The cost of those tile blits are actually why we’re running a compositor in Raspbian in the first place, because window movement is too expensive otherwise.

My crazy idea was to make a new mode for glamor that just keeps all pixmaps in system memory until they need to be on the GPU (because they’re either scanned out by the GPU or they have an exposed reference to them for DRI). My previous reworks for linear vs tiled buffers in glamor already made this look easy. You end up with software fallbacks all the time (The old fbturbo driver was all software fallbacks, though), and with the new gbm_bo_map() interface we should be able to keep the cost of software fallbacks down to just the cost of the rendering performed, not a 1920x1080 readback per drawing operation.

However, 20 patches in and without having gotten to the point of gbm_bo_map() yet, this is looking like a mess. On Friday I had breakfast with keithp and talked over what I was doing, and he convinced me to go back to an older, simpler plan: Just make all of our pixmaps tiled, so we don’t have the shadow copies at all. If we don’t have the shadow copies, then we don’t need the compositor to get window-dragging performance. At least 64MB of the allocations above would go away.

The new plan means we eat the memory bandwidth cost of scanning out from tiled, but it will be easy enough to register a timer to go make a linear copy and flip to that. We’re not the only platform that would like that, I’m sure.

I’ve now written the kernel interface and debugging it is the plan for this week.

In other news, my panel-bridge v3 is now in drm-misc-next, along with most of the associated code deletion. I also helped the team trying to integrate my PL111 code with a bug in panning support, which their current 3D stack relies on. In the process of debugging the pl111+vc4 code (a full screen X11 fallback would reallocate the BO when it shouldn’t), I added debugfs register dumping for pl111, which should be useful for other developers using the code.

I have the X Server building in Travis CI, at long last. The meson build system was critical for this, as autotools was just too slow. The other key part was switching to Docker for bundling our dependencies, since 20 autotools invocations was unreasonable, and I won’t have the time to convert those 20 things to meson. Next steps on CI are to get the X Test suite running in the meson build system (the docker image already has it present).

Finally, earlier this month I submitted my Mesa register allocation series (let the driver choose the color during graph coloring), which should be good for about a 2% performance improvement to vc4 3D.

June 01, 2017

With modifier support added to Mesa and gbm_gralloc, it is now possible to boot Android on iMX6 platforms using no proprietary blobs at all. This makes iMX6 one of the very few embedded SOCs that needs no blobs at all to run a full graphics stack.

Not only is that a great win for Open Source in general, but it also makes the iMX6 more attractive as a platform. A further positive point is that this lays the groundwork for the iMX8 platform, and supporting it will come much easier.

What are modifiers used for?

Modifiers are used to represent different properties of buffers. These properties can cover a range of different information about a buffer, for example compression and tiling.

For the case of …

With modifier support added to Mesa and gbm_gralloc, it is now possible to boot Android on iMX6 platforms using no proprietary blobs at all. This makes iMX6 one of the very few embedded SOCs that needs no blobs at all to run. Not only is that a great win for Open Source in general, but it also makes the iMX6 more attractive as a platform.

Currently the modifiers work is in the process of being upstreamed, but in the meantime it can be found here. If you'd like to test this out yourself I maintain a how-to.

What are modifiers used for?

Modifiers are used to represent different properties of buffers. These properties can cover a range of different information about a buffer, for example …

May 30, 2017

DRM leasing part three (Vulkan)

With the kernel APIs off for review, and the X RandR bits looking like they're in reasonable shape, I finally found some time to sit down and figure out how I wanted to integrate this into Vulkan.

Avoiding two DRM file descriptors

Given that a DRM lease is represented by a DRM master file descriptor, we want to use that for all of the operations in the driver, including rendering and mode setting. Using the vulkan driver render node and the lease master node together would require passing buffer objects between the kernel contexts using even more file descriptors.

The Mesa Vulkan drivers open the device nodes while enumerating devices, not when they are created. This seems a bit early to me, but it makes sure that the devices being enumerated are actually available for use, and not just present in the system. To replace the render node fd with the lease master fd means hooking into the system early enough that the enumeration code can see the lease fd. And that means creating an instance extension as the instance gets created before devices are enumerated.

The VK_KEITHP_kms_display instance extension

This simple instance extension provides the necessary hooks to get the lease information from the application down into the driver before the DRM node is opened. In the first implementation, I added a function that could be called before the devices were enumerated to save the information in the Vulkan loader. That worked, but required quite a bit of study of the Vulkan loader and its XML description of the full Vulkan API.

Mark Young suggested that a simpler plan would be to chain the information into the VkInstanceCreateInfo pNext field; with no new APIs added to Vulkan, there shouldn't be any need to change the Vulkan loader -- the device driver would advertise the new instance extension and the application could find it.

That would have worked great, except the Vulkan loader 'helpfully' elides all instance extensions it doesn't know about before returning the list to the application. I'd say this was a bug and should be fixed, but for now, I've gone ahead and added the few necessary definitions to the loader to make it work.

In the application, it's a simple matter of searching for this extension, constructing the VkKmsDisplayInfoKEITHP structure, chaining that into the VkInstanceCreateInfo pNext list and passing that in to the vkCreateInstance call.

typedef struct VkKmsDisplayInfoKEITHP {
    VkStructureType         sType;  /* VK_STRUCTURE_TYPE_KMS_DISPLAY_INFO_KEITHP */
    const void*             pNext;
    int                     fd;
    uint32_t                crtc_id;
    uint32_t                *connector_ids;
    int                     connector_count;
    drmModeModeInfoPtr      mode;
} VkKmsDisplayInfoKEITHP;

As you can see, this includes the master file descriptor along with all of the information necessary to set the desired video mode using the specified resources.

The driver just walks the pNext list from the VkInstanceCreateInfo structure looking for any provided VkKmsDisplayInfoKEITHP structure and pulls the data out.

To avoid questions about file descriptor lifetimes, the driver dup's the provided fd. The application is expected to close their copy at a suitable time.

The VK_KHR_display extension

Vulkan already has an API for directly accessing the raw device, including code for exposing video modes and everything. As tempting as it may be to just go do something simpler, there's a lot to be said for using existing APIs.

This extension doesn't provide any direct APIs for acquiring display resources, relying on the VK_EXT_acquire_xlib_display extension for that part. And that takes a VkPhysicalDisplay parameter, which is only available after the device is opened, which is why I created the VK_KEITHP_kms_display extension instead of just using the VK_EXT_acquire_xlib_display extension -- we can't increase the capabilities of the render node opened by the driver, and we don't want to keep two file descriptors around.

With the information provided by the VK_KEITHP_kms_display extension, we can implement all of the VK_KHR_display extension APIs, including enumerating planes and modes and creating the necessary display surface. Of course, there's only one plane and one mode, so some of the implementation is pretty simplistic.

The big piece of work was to create the swap chain structure and associated frame buffers.

A working example

I've taken the 'cube' example from the Vulkan loader and hacked it up to use XCB to construct a DRM lease, the VK_KEITHP_kms_display extension to pass that lease into the Vulkan driver. The existing support for the VK_KHR_display extension "just worked", which was pretty satisfying.

It's a bit of a mess

I'm not satisfied with the mesa code at this point; there's a bunch of code in the radeon driver which should be in the vulkan WSI bits, and the vulkan WSI bits should probably not have the KMS interfaces wired in. I'll ask around and see what other Mesa developers think I should do before restructuring it further; I'll probably have to rewrite it all at least one more time before it's ready to upstream.

Seeing the code

I'll be cleaning the code up a bit before sending it out for review, but it's already visible in my own repositories:

May 29, 2017

tl;dr We can create coverage-instrumented binaries, run them and aggregate the coverage data from running both the program and the unit tests.

In the Go world, unit testing is tightly integrated with the go tool chain. Write some unit tests, run go test and tell anyone that will listen that you really hope to never have to deal with a build system for the rest of your life.

Since Go 1.2 (Dec. 2013), go test has supported test coverage analysis: with the ‑cover option it will tell you how much of the code is being exercised by the unit tests.

So far, so good.

I've been wanting to do something slightly different for some time though. Imagine you have a command line tool. I'd like to be able to run that tool with different options and inputs, check that everything is OK (using something like bats) and gather coverage data from those runs. Even better, wouldn't be neat to merge the coverage from the unit tests with the one from those program runs and have an aggregated view of the code paths exercised by both kind of testing?

A word about coverage in Go

Coverage instrumentation in Go is done by rewriting the source of an application. The cover tool inserts code to increment a counter at the start of each basic block, a different counter for each basic block of course. Some metadata is kept along side each of the counters: the location of the basic block (source file, start/end line & columns) and the size of the basic block (number of statements).

This rewriting is done automatically by go test when coverage information has been asked by the user (go test -x to see what's happening under the hood). go test then generates an instrumented test binary and runs it.

A more detailed explanation of the cover story can be found on the Go blog.

Another interesting thing is that it's possible to ask go test to write out a file containing the coverage information with the ‑coverprofile option. This file starts with the coverage mode, which is how the coverage counters are incremented. This is one of set, count or atomic (see blog post for details). The rest of the file is the list of basic blocks of the program with their metadata, one block per line:

github.com/clearcontainers/runtime/oci.go:241.29,244.9 3 4

This describes one piece of code from oci.go, composed of 3 statements without branches, starting at line 241, column 29 and finishing at line 244, column 9. This block has been reached 4 times during the execution of the test binary.

Generating coverage instrumented programs

Now, what I really want to do is to compile my program with the coverage instrumentation, not just the test binary. I also want to get the coverage data written to disk when the program finishes.

And that's when we have to start being creative.

We're going to use go test to generate that instrumented program. It's possible to define a custom TestMain function, an entry point of a kind, for the test package. TestMain is often used to setup up the test environment before running the list of unit tests. We can hack it a bit to call our main function and jump to running our normal program instead of the tests! I ended up with something like this:


The current project I'm working on is called cc-runtime, an OCI runtime spawning virtual machines. It definitely deserves its own blog post, but for now, knowing the binary name is enough. Generating a coverage instrumented cc-runtime binary is just a matter of invoking go test:

$ go test -o cc-runtime -covermode count

I haven't used atomic as this binary is really a thin wrapper around a library and doesn't use may goroutines. I'm also assuming that the use of atomic operations in every branch a "quite a bit" higher then the non-atomic addition. I don't care too much if the counter is off by a bit, as long as it's strictly positive.

We can run this binary just as if it were built with go build, except it's really a test binary and we have access to the same command line arguments as we would otherwise. In particular, we can ask to output the coverage profile.

$ ./cc-runtime -test.coverprofile=list.cov list
[ outputs the list of containers ]

And let's have a look at list.cov. Hang on... there's a problem, nothing was generated: we din't get the usual "coverage: xx.x% of statements" at the end of a go test run and there's no list.cov in the current directory. What's going on?

The testing package flushes the various profiles to disk after running all the tests. The problem is that we don't run any test here, we just call main. Fortunately enough, the API to trigger a test run is semi-public: it's not covered by the go1 API guarantee and has "internal only" warnings. Not. Even. Scared. Hacking up a dummy test suite and running is easy enough:


There is still one little detail left. We need to call this FlushProfiles function at the end of the program and that program could very well be using os.Exit anywhere. I couldn't find better than having a tiny exit package implementing the equivalent of the libc atexit() function and forbid direct use of os.Exit in favour of exit.Exit(). It's even testable.

Putting everything together

It's now time for a full example. I have a small calc program that can compute additions and substractions.

$ calc add 4 8
12

The code isn't exactly challenging:


I've written some unit-tests for the add function only. We're going to run calc itself to cover the remaining statements. But first, let's see the unit tests code with both TestAdd and our hacked up TestMain function. I've swept the hacky bits away in a cover package.


Let's run the unit-tests, asking to save a unit-tests.cov profile.

$ go test -covermode count -coverprofile unit-tests.cov
PASS
coverage: 7.1% of statements
ok github.com/dlespiau/covertool/examples/calc 0.003s

Huh. 7.1%. Well, we're only testing the 1 statement of the add function after all. It's time for the magic. Let's compile an instrumented calc:

$ go test -o calc -covermode count

And run calc a few times to exercise more code paths. For each run, we'll produce a coverage profile.

$ ./calc -test.coverprofile=sub.cov sub 1 2
-1
$ covertool report sub.cov
coverage: 57.1% of statements

$ ./calc -test.coverprofile=error1.cov foo
expected 3 arguments, got 1
$ covertool report error1.cov
coverage: 21.4% of statements

$ ./calc -test.coverprofile=error2.cov mul 3 4
unknown operation: mul
$ covertool report error2.cov
coverage: 50.0% of statements

We want to aggregate those profiles into one single super-profile. While there are some hints people are interested in merging profiles from several runs (that commit is in go 1.8), the cover tool doesn't seem to support these kind of things easily so I wrote a little utility to do it: covertool

$ covertool merge -o all.cov unit-tests.cov sub.cov error1.cov error2.cov

Unfortunately again, I discovered a bug in Go's cover and so we need covertool to tell us the coverage of the aggregated profile:

$ covertool report all.cov
coverage: 92.9% of statements

Not Bad!

Still not 100% though. Let's fire the HTML coverage viewer to see what we are missing:

$ go tool cover -html=all.cov


Oh, indeed, we're missing 1 statement. We never call add from the command line so that switch case is never covered. Good. Seems like everything is working as intended.

Here be dragons

As fun as this is, it definitely feels like very few people are doing this kind of instrumented binaries. Everything is a bit rough around the edges. I may have missed something obvious, of course, but I'm sure the Internet will tell me if that's the case!

It'd be awesome if we could have something nicely integrated in the future.

May 23, 2017

TLDR: If you see devices like "xwayland-pointer" show up in your xinput list output, then you are running under a Wayland compositor and debugging/configuration with xinput will not work.

For many years, the xinput tool has been a useful tool to debug configuration issues (it's not a configuration UI btw). It works by listing the various devices detected by the X server. So a typical output from xinput list under X could look like this:


:: whot@jelly:~> xinput list
⎡ Virtual core pointer id=2 [master pointer (3)]
⎜ ↳ Virtual core XTEST pointer id=4 [slave pointer (2)]
⎜ ↳ SynPS/2 Synaptics TouchPad id=22 [slave pointer (2)]
⎜ ↳ TPPS/2 IBM TrackPoint id=23 [slave pointer (2)]
⎜ ↳ ELAN Touchscreen id=20 [slave pointer (2)]
⎣ Virtual core keyboard id=3 [master keyboard (2)]
↳ Virtual core XTEST keyboard id=5 [slave keyboard (3)]
↳ Power Button id=6 [slave keyboard (3)]
↳ Video Bus id=7 [slave keyboard (3)]
↳ Lid Switch id=8 [slave keyboard (3)]
↳ Sleep Button id=9 [slave keyboard (3)]
↳ ThinkPad Extra Buttons id=24 [slave keyboard (3)]
Alas, xinput is scheduled to go the way of the dodo. More and more systems are running a Wayland session instead of an X session, and xinput just doesn't work there. Here's an example output from xinput list under a Wayland session:

$ xinput list
⎡ Virtual core pointer id=2 [master pointer (3)]
⎜ ↳ Virtual core XTEST pointer id=4 [slave pointer (2)]
⎜ ↳ xwayland-pointer:13 id=6 [slave pointer (2)]
⎜ ↳ xwayland-relative-pointer:13 id=7 [slave pointer (2)]
⎣ Virtual core keyboard id=3 [master keyboard (2)]
↳ Virtual core XTEST keyboard id=5 [slave keyboard (3)]
↳ xwayland-keyboard:13 id=8 [slave keyboard (3)]
As you can see, none of the physical devices are available, the only ones visible are the virtual devices created by XWayland. On a Wayland session, the X server doesn't have access to the physical devices. Instead, it talks via the Wayland protocol to the compositor. This image from the Wayland documentation shows the architecture:
In the above graphic, devices are known to the Wayland compositor (1), but not to the X server. The Wayland protocol doesn't expose physical devices, it merely provides a 'pointer' device, a 'keyboard' device and, where available, a touch and tablet tool/pad devices (2). XWayland wraps these into virtual devices and provides them via the X protocol (3), but they don't represent the physical devices.

This usually doesn't matter, but when it comes to debugging or configuring devices with xinput we run into a few issues. First, configuration via xinput usually means changing driver-specific properties but in the XWayland case there is no driver involved - it's all handled by libinput inside the compositor. Second, debugging via xinput only shows what the wayland protocol sends to XWayland and what XWayland then passes on to the client. For low-level issues with devices, this is all but useless.

The takeaway here is that if you see devices like "xwayland-pointer" show up in your xinput list output, then you are running under a Wayland compositor and debugging with xinput will not work. If you're trying to configure a device, use the compositor's configuration system (e.g. gsettings). If you are debugging a device, use libinput-debug-events. Or compare the behaviour between the Wayland session and the X session to narrow down where the failure point is.

May 20, 2017

Not only, but to a large extent I worked in the last few months on foundational improvements to KWin’s DRM backend, which is a central building block of KWin’s Wayland session. The idea in the beginninng was to directly expand upon my past Atomic Mode Setting (AMS) work from last year. We’re talking about direct scanout of graphic buffers for fullscreen applications and later layered compositing. Indeed this was my Season of KDE project with Martin Flöser as my mentor, but in the end relative to the initial goal it was unsuccessful.

The reason for the missed goal wasn’t a lack of work or enthusiasm from my side, but the realization that I need to go back and first rework the foundations, which were in some kind of disarray, mostly because of mistakes I did when I first worked on AMS last year, partly because of changes Daniel Stone made to his work-in-progress patches for AMS support in Weston, which I used as an example throughout my work on AMS in KWin, and also because of some small flaws introduced to our DRM backend before I started working on it.

The result of this rework are three seperate patches depending on each other and all of them got merged last week. They will be part of the 5.10 release. The reason for doing three patches instead of only one, was to ease the review process.

The first patch dealt with the query of important kernel display objects, which represent real hardware, the CRTCs and Connectors. KWin didn’t remember these objects in the past, although they are static while the system is running. This meant for example that KWin requeried all of them on a hot plugging event and had no prolonged knowledge about their state after a display was disconnected again. The last point made it in particular difficult to do a proper cleanup of the associated memory after a disconnect. So changing this in a way that the kernel objects are only queried once in the beginning made sense. Also from my past work I already had created a generic class for kernel object with the necessary subclasses, which could be used in this context. But still to me this patch was the most “controversial” one of the three, which means it was the one I was most worried about being somehow “wrong”, not just in details, but in general, especially since it didn’t solve any observable specific misbehaviour, which it could be benchmarked against. Of course I did my research, but there is always the anxiety of overlooking something crucial. Too bad the other patches depended on it. But the patch was accepted and to my relief everything seems to work well on the current master and the beta branch for the upcoming release as well.

The second patch restructured the DrmBuffer class. We support KWin builds with or without Generic Buffer Manager (GBM). It made therefore sense to split off the GBM dependent part of DrmBuffer into a seperate file, which gets only included when GBM is available. Martin had this idea and, although the patch is still quite large because of all the moved around code and renamed classes, the change was straight forward. I still managed to introduce a build breaking regression, which was quickly discovered and easily to solve. This patch was also meant as a preperation for the future direct scanout of buffers, which will then be done by a new subclass of DrmBuffer, also depending on GBM.

The last patch finally directly tackled all the issues I experienced when trying to use the before that rather underwhelming code path for AMS. Yes, you saw the picture on the screen, the buffer flipping worked, but basic functionality like hot plugging or display suspending were not working at all or led to unpredictable behaviour. Basically a complete rewrite later with many, many manual in and out pluggings of external monitors to test the bahaviour the problems have been solved to the point I consider the AMS code path now to be ready for daily use. For Plasma 5.11 I therefore plan to make it the new default. That means that it will be available on Intel graphics automatically from Linux kernel 4.12 onwards, when on the kernel side the Intel driver also defaults to it. If you want to test my code on Plasma 5.10 you need to set the environment variable KWIN_DRM_AMS and on kernels older than 4.12 you need to add the boot parameter i915.nuclear_pageflip. If you use Nvidia with the open source Nouveau driver, AMS should be available to you since kernel 4.10. In this case you should only need to set the environment variable above on 5.10, if you want to test it. Since I only tested AMS with Intel graphics until now, some reports back how it works with Nouveau would be great.

That’s it for now. But of course there is more to come. I haven’t given up on the direct scanout and at some point in the future I want to finish it. I already had a working prototype and mainly waited for my three patches to land. But for now I’ll postpone further work on direct scanout and layered compositing. Instead the last weeks I worked on something special for our Wayland session in 5.11. I call it Night Color, and with this name you can probably guess what it will be. And did I mention, that I was accepted as a Google Summer of Code student for the X.org foundation with my project to implement multi buffered present support in XWayland? Nah, I didn’t. Sorry for rhetorical asking in this smug way, but I’m just very happy and also a bit proud of having learned so much in basically only one year to the point of now being able to start work on an X.org project directly. I’ll write about it in another blog post in the near future.

May 19, 2017
Two weeks have now passed since my introductory blog post, so as promised here is part two! The theme of this blog post is probably something along the lines of “preparation”, as that is what I’ve been doing mostly. The period between that of announcing the accepted student proposals and phase one is called the community bonding period. In this period students are supposed to get to know their mentors, their organizations, familiarize themselves with their projects (even more) and get everything ready to start off the coding period on May 30th.
May 14, 2017

Early last year, Oracle announced it would be shutting down the general project hosting on java.net & kenai.com at the end of April 2017.

We'd been hosting various Solaris content on there since opensolaris.org shut down in 2013, and have worked to move most of it off now. We've requested some redirects be put in place where possible, but they only offer one per subproject, so we can't make all pages redirect to the best spot for that particular content.

If you referenced anything under http://solaris.java.net or its sub projects, here's a guide to where to find that content now:

FOSS Source Code:
Discussion:
The java.net mailing lists were not heavily used, so we've not set up new replacement mailing lists, but suggested discussion migrate to the forums at https://community.oracle.com/community/server_&_storage_systems instead.
Information pages:

And of course, much of it was archived in the Internet Archive for historical reference.

May 09, 2017

I’ve been procrastinating on writing up status, so this update’s a big one.

Boris Brezillon has been hacking away at implementing the VC4 transposer module as a DRM writeback connector, and has things running in test apps. My immediate goal with his work is to enable IGT testing of VC4’s HVS (hardware compositor) features, so that we can implement plane tiling and formats with confidence. However, there’s also a long term goal to extend X11’s modesetting driver to do Present into DRM planes, and only collapse planes down to the primary framebuffer once an atomic check fails or the display goes idle. This is something that other compositors like Raspberry Pi’s DispmanX firmware, Android HWC2, or Weston can all do, so we should fix X11 to do it as well.

Hans Verkuil has started on a CEC implementation for vc4’s HDMI module. The hardware looks fairly simple – unlike many other HDMI CEC implementations, ours is right there in the HDMI module, so there are no complicated inter-kernel-module dependencies to implement. I’ve set him up with register definitions and answered some hardware questions, and I suspect he’ll have results soon.

On my end, I’m excited to announce that the ARM pl111 module DRM KMS driver for Broadcom’s Cygnus is now merged to drm-misc-next, along with V3D support. This has been a more or less pleasant process: I got useful review feedback from Linus Walleij (who’s being involved in the fbdev version of this driver), some cleanups on the V3D code from Florian, and I got the code landed in a reasonable amount of time. The drm-misc process may not be perfect, but it’s the best I’ve experienced in the Linux kernel yet.

Next steps on Cygnus: I need to land a reset controller driver and the associated DT (right now if you lock up the GPU, it never resets). On Raspberry Pi reset works by us turning the power domain off and on, which causes the firmware to go through the reset process, but on Cygnus we don’t have a separate power domain for V3D. I also need to finish the panel driver and submit it (I’m using a stub for now, and have a proper panel driver in progress). And I’m working on using gallium’s “renderonly” helper functions to allow you to bring up GBM with GL accelerated on vc4 on a pl111 DRM node. This should get X, glmark, and kmscube (and any other GBM applications), working on Cygnus with no userspace changes.

I’ve also been reworking Raspberry Pi’s DSI panel support. I got really discouraged on this after trying to submit to the DRM panel tree back in December. Based on that feedback, I’m moving most of the code to a DRM bridge driver, with just the panel timing in the DRM panel tree. However, writing drivers that talk to DRM panels sucks, so I’ve written the “panel-bridge” helper that cuts about 100 lines of boilerplate from most DRM drivers that talk to a panel, by wrapping the panel in a DRM bridge structure. I submitted an early version for review last week, which was met with some skepticism. I need to re-submit my updated version that converts more drivers – probably when the patch series removes 400 lines of code from DRM, it’ll get a warmer reception.

I’ve also been working on cleaning up our GLES2 conformance test results. I’ve landed 4 patches to the GLSL compiler, and one more Mesa patch is in the queue. The most interesting problem for conformance is that some of the GLES2 conformance tests fail in our register allocator: vc4 can’t spill registers, so we need to be really good at register allocation to make arbitrary programs render successfully.

My current goal for allocation is to implement Sethi-Ullman based instruction scheduling at the QIR stage to try to keep register pressure down. However, register pressure is weighted at a middle level in the QIR instruction scheduler’s heuristics, because otherwise later on we end up too constrained for the QPU instruction scheduler and introduce stalls. To fix the QPU scheduling, I wrote a patch series that gives the driver a chance to choose among available registers during register coloring. With that, the driver can prioritize the accumulators while rotating through the available registers. Initial performance results look excellent, so I should be able to bump register pressure up in priority next and do the Sethi-Ullman work. Interestingly, this callback could resolve a longstanding issue that Intel’s pre-Sandybridge driver had in register allocation as well!

In the X Server, I’ve landed the meson build system. I’m not quite dogfooding it yet, because I can successfully run my desktop with startx but not with gdm (it looks like something is going wrong in the systemd integration). I’ve received patches from a few other X developers fixing up the meson build system for themselves, so I think we’re on track toward general adoption. I’ve made some more progress on implementing Travis CI for the X Server now, and on integrating the current X Server unit tests into meson.

May 05, 2017
Yesterday, after waiting for what felt like a very long time, I finally got the results of my Google Summer of Code (GSoC) application! I can happily say that one of my proposals has been accepted: I will work with Peter Hutterer to redesign and rewrite Piper. The synopsis of my proposal is as follows: Piper is an application frontend to libratbag and ratbagd, a library and system daemon to configure gaming mice respectively.
May 04, 2017
Test run totals:
Passed: 109293/150992 (72.4%)
Failed: 0/150992 (0.0%)
Not supported: 41697/150992 (27.6%)
Warnings: 2/150992 (0.0%)

This is effectively a pass. The Not Supported stuff isn't missing features as uneducated people are quick to spout, it's more stuff the hardware doesn't support or is pointless to expose on the hardware. (lots of image formats).

This is the results from the Vulkan CTS 1.0.2 branch, against mesa master with one patch (a workaround for some InternalErrors that CTS throws up).

Do not call the driver conformant as that is against the Khronos rules as we haven't paid or filed for approval, but the driver does now effectively pass the latest conformance test suite. I'll update on things if that changes.

Thanks again to everyone involved.
April 26, 2017

Since the hardware very much matters this is going to be divided into a few parts, the common steps and the hardware specific ones.

This post is a bit of a living document and will be changed over time, and if you have any questions about it, please reach out through email (robert.foss at collabora.com) or irc (tomeu or robertfoss on #dri-devel on freenode).

Changelog:
2017-04-27: build_android.sh, setup_sdcard.sh: Added -b [device] support to build_android.sh and setup_sdcard.sh
2017-05-02: build_android.sh: Don't write to SD-card without -b option
2017-05-04: Switch git repo urls to shared repository
2017-05-09: Add compiler installation to apt-get
2017-05-09: Re-ordered some instructions
2017 …
April 25, 2017

So we have a new job available for someone interested in joing our team and work on improving the Linux graphics stack. The focus of this job will be on GPU compute related work, but you should also expect to be spending time on improving the graphics driver stack in general. We are looking for someone at the Principal Engineer level, but I do recommend that even if you don’t feel you are quite at that level yet you should apply because to be fair the amount of people with the kind of experience we are looking for are few and far between, so in the end there is a chance we will hire two more junior developers instead if we have candidates with the right profile.

We are quite flexible on working location for this job, so for the right candidate working remotely is definitely a possibility. And of course if you are interested in joining us at one of our offices that is an option too, for instance we have existing team members working out of our Boston (USA), Brno(Czech Republic), Brisbane (Australia) and Munich (Germany) offices.

GPU Compute is rapidly growing in importance and use so this is your chance to be in the middle of it and work for what I personally think is one of the best companies in the world to work for.

So be sure to submit an application though the Red Hat hiring portal.

April 23, 2017

Since the hardware very much matters this is going to be divided into a few parts, the common steps and the hardware specific ones.

Common steps

mkdir /opt/android
repo init -u https://android.googlesource.com/platform/manifest -b android-7.1.1_r28
cd /opt/android/.repo
git clone git://git.collabora.com/git/user/robertfoss/android_manifest.git local_manifests -b etnaviv-android
repo sync -j75

mkdir /opt/imx6_android
cp /opt/imx6_android
git clone git://git.collabora.com/git/user/robertfoss/linux.git -b imx_rdu2_v4.11-rc3

# The mkimage tool is used even if you're not
# using u-boot it as a bootloader
sudo apt install u-boot-tools

# Fetch Kconfig, bootloaders and some scripts
git clone git://git.collabora.com/git/user/robertfoss/rdu2.git .

# This will destroy all data …
Of the Wayland demo clients in the Weston repository, simple-shm is the simplest. All the related code is in that one file, and it interfaces directly with libwayland. It does not use GL or EGL, so it can be ran on systems where the EGL stack does not support the Wayland platform nor extensions. However, what it renders, is surprising:
The original simple-shm client on a Weston desktop.

The square with apparently garbage texture is the original simple-shm. To any graphics developer, who does not know any better, that immediately looks like something is wrong with the image stride somewhere in the graphics stack. That really is what it was supposed to look like, not a bug.

I decided to propose a different rendering, that would not look so much like a bug, and had some real diagnostic value.
The proposed appearance of simple-shm, the way it is supposed to look like.
The new appearance has some vertical bars moving from left to right, some horizontal bars moving upwards, and some circles that shrink into the center. With these, you can actually see if there is a stride bug somewhere, or non-uniform scaling. There is one more diagnostic feature.
This is how the proposed simple-shm looks like when the X-channel is mistaken as alpha.
Simple-shm uses XRGB buffers. If the compositor does not properly ignore the X-channel, and uses it as alpha, you will see a cross over the image. Depending on whether the compositor repaints what is below simple-shm or not, the cross will either saturate to white or show the background through. It is best to have a bright background picture to clearly see it.

I do hope no-one gets hypnotized by the animation. ;-)
Recently I drew some diagrams of how an EGL library relates to the Wayland stack. Here I am presenting the Mesa EGL version of them with the details explained.
Mesa EGL with Wayland, and simplified X as comparison.


X11 part

The X11 part of the diagram is very much simplified. It completely ignores indirect rendering, DRI1, details of DRI2, and others. It only shows, that a direct rendering X11 EGL application uses the X11 protocol to create an X11 window, and the Mesa EGL X11 platform uses the DRI2 protocol in some way to communicate with the X server. Naturally the application also uses one of the OpenGL interfaces. The X server has hardware or platform specific drivers that are generally referred to as DDX. On the Linux DRI stack, these call into libdrm and the various driver specific sub-libraries. In the end they use the kernel DRM services, like kernel mode setting (KMS). All this in the diagram is just for comparison with a Wayland stack.

Wayland server

The Wayland server in the diagram is Weston with the DRM backend. The server does its rendering using GL ES 2, which it initialises by calling EGL. Since the server runs on "bare KMS", it uses the EGL DRM platform, which could really be called as the GBM platform, since it relies on the Mesa GBM interface. Mesa GBM is an abstraction of the graphics driver specific buffer management APIs (for instance the various libdrm_* libraries), implemented internally by calling into the Mesa GPU drivers.

Mesa GBM provides graphics memory buffers to Weston. Weston then uses EGL calls to bind them into GL objects, and renders into them with GL ES 2. A rendered buffer is shown on an output (monitor) by queuing a page flip via the libdrm KMS API.

If the EGL implementation offers the extension EGL_WL_bind_wayland_display, Weston will use it to register its wl_display object (facing the clients) to EGL. In practice, the Mesa EGL then adds a new global Wayland object to the wl_display. That object (or interface) is called wl_drm, and the server will automatically advertise that to all clients. Clients will use wl_drm for DRM authentication, getting the right DRM device node, and sharing graphics buffers with the server without copying pixels.

Wayland client

A Wayland client, naturally, connects to a Wayland server, and gets the main Wayland protocol object wl_display. The client creates a window, which is a Wayland object of type wl_surface. All what follows is enabled by the Wayland platform support in Mesa EGL.

The client passes the wl_display object to eglGetDisplay() and receives an EGLDisplay to be used with EGL calls. Then comes the trick that is denoted by the double-arrowed blue line from Wayland client to Mesa EGL in the diagram. The client calls the wayland-egl API (implemented in Mesa) function wl_egl_window_create() to get the native window handle. Normally you would just use the "real" native window object wl_surface (or an X11 Window if you were using X). The native window handle is used to create the EGLSurface EGL handle. Wayland has this extra step and the wayland-egl API because a wl_surface carries no information of its size. When the EGL library allocates buffers, it needs to know the size, and wayland-egl API is the only way to tell that.

Once EGL Wayland platform knows the size, it can allocate a graphics buffer by calling the Mesa GPU driver. Then this graphics buffer needs to be mapped into a Wayland protocol object wl_buffer. A wl_buffer object is created by sending a request through the wl_drm interface carrying the name of the (DRM) graphics buffer. In the server side, wl_drm requests are handled in the Mesa EGL library, where the corresponding server side part of the wl_buffer object is created. In the diagram this is shown as the blue dotted arrow from EGL Wayland platform to itself. Now, whenever the wl_buffer object is referenced in the Wayland protocol, the server knows exactly what it is.

The client now has an EGLSurface ready, and renders into it by using one of the GL APIs or OpenVG offered by Mesa. Finally, the client calls eglSwapBuffers() to show the result in its Wayland window.

The buffer swap in Mesa EGL Wayland platform uses the Wayland core protocol and sends an attach request to the wl_surface, with the wl_buffer as an argument. This is the blue dotted arrow from EGL Wayland platform to Wayland server.

Weston itself processes the attach request. It knows the buffer is not a shm buffer, so it passes the wl_buffer object to the Mesa EGL library in an eglCreateImageKHR() function call. In return Weston gets an EGLImage handle, which is then turned into a 2D texture, and used in drawing the surface (window). This operation is enabled by EGL_WL_bind_wayland_display extension.

Summary

The important facts, that should be apparent in the diagram, are:
  • There are two different EGL platforms in play: one for the server, and one for the clients.
  • A Wayland server does not contain any graphics hardware or driver specific code, it is all in the generic libraries of DRM, EGL and GL (libdrm and Mesa).
  • Everything about wl_drm is an implementation detail internal to the EGL library in use.
The system dependent part of Weston is the backend, which somehow must be able to drive the outputs. The new abstractions in Mesa (GBM API) make it completely hardware agnostic on standard Linux systems. Therefore every Wayland server implementation does not need its own set of graphics drivers, like X does.

It is also worth to note, that 3D graphics on X uses very much the same drivers as Wayland. However, due to the Wayland requirements from the EGL framework (extensions, EGL Wayland platform), proprietary driver stacks need to specifically implement Wayland support, or they need to be wrapped into a meta-EGL-library, that glues Wayland support on top. Proprietary drivers also need to provide a way to use accelerated graphics without X, for a Wayland server to run without X beneath. Therefore the desktop proprietary drivers like Nvidia's have a long way to go, as currently nvidia does not implement EGL at all, no support for Wayland, and no support for running without X, or even setting a video mode without X.

Due to the way wl_drm is totally encapsulated into Mesa EGL and how the interfaces are defined for the EGL Wayland platform and the EGL extension, another EGL implementor can choose their very own way of sharing graphics buffers instead of using wl_drm.

There are already plans to change to some of the architecture described in this article, so it is possible that details in the diagram become outdated fairly soon. This article also does not consider a purely software rendered Wayland stack, which certainly would lift all these requirements, but quite likely be too slow in practice for the desktop.

See also: the authoritative description of the Wayland architecture
While being contracted by Collabora, I started a Wayland R&D project in October 2011 with the primary goal of getting to know Wayland, and strengthening Wayland expertise in Collabora. During the four months I started the wl_shell_surface protocol for desktops, added screen locking, ported an X screensaver to Wayland with new protocol, and most recently implemented surface transformations in Weston (the reference compositor, originally the wayland-demos compositor). All this sponsored by Collabora.

The project started by getting wayland-demos running under X, and then looking into the bugs I hit. To rule out problems in hardware GL renderer, I also got the demos running with softpipe and llvmpipe. Trying to fix segmentation faults and other obvious problems was my stepping stone into the Wayland code base.

My first real piece of work was screen locking. That included adding special protocol for it, having a way to have privileged Wayland clients, implementing locking in the shell plugin in the compositor, and writing an unlock dialog for the desktop-shell client. Those are the obvious parts. I also had to extend the shell plugin interface, find a way to hide surfaces so they do not render while the screen is locked, and of course bug hunting and patch set rebasing and rewriting, before screen locking landed upstream.

Next was porting an X screensaver as a regular Wayland client. Once that worked, I extended the protocol by adding a screensaver interface, and made the shell plugin automatically start the screensaver application. Handling screensavers would have been a walk in the park, except I needed shell-specific data to be attached to all surfaces. I wrote a hacky solution, but in the end, Kristian Høgsberg wanted me to add a whole new interface into the shell protocol for this. It became the wl_shell_surface interface, and all demo clients needed to adopt it. Yet that was not all. Since we are used to have per-monitor screensavers, I needed my screensaver to set different instances for each monitor. Hence I had to add output event callbacks in the toytoolkit.

A cleanup phase came next, I took Valgrind and ran. I fixed a pile of memory leaks and wrote missing destructor functions all over, in compositor, clients and the toytoolkit, at the same time collecting a Valgrind suppressions list to ease Valgrinding in the future. This work included adding some ad hoc way of cleanly exiting demo clients.

In January there were some discussions on maximised and full-screen surfaces, what they are and how they should be implemented. Surface scaling was raised as one point. Weston already had the zoom effect, and full-screen scaling would be another surface transformation, so I decided to write a transformation matrix stack for supporting any number of simultaneous transformations. It turned out to be a three week task.

Implementing surface transformations required changes all over Weston. First, I needed a way to invert the transformation which is a 4-by-4 matrix. After searching in vain for a MIT-licenced C implementation I wrote one myself, based on LU-decomposition. I believe LU-decomposition is more efficient on a 4x4 matrix than the cofactor method. Along the inversion routines, I wrote a unit test application for testing the speed and precision of the inversion. Detecting and dealing with non-invertible transformations is also important.

Going through the transformation stack every time you need to transform a point might be costly, so I added a cached total transform and its inverse. Implementing input redirection was a simple matter of applying the inverse total transform to pointer coordinates. Needing a way to test transformations, I added a Weston key binding for rotating surfaces, and modified an existing demo application to mark the clicked point. Adding functions for explicitly converting between display global coordinates and surface local coordinates (surface local are the only ones a client knows of) clarified some of the coordinate computations.

Surface painting and damage region tracking needed fixes, too. Previously, a zoomed surface was repainted as a whole, and it forced a full display redraw, i.e. damaging the whole display. Transformed surface repaint needed to start honoring the repaint regions, so we could avoid excessive repainting. Damage and repaint regions are tracked as global coordinate axis aligned rectangles. Whenever a transformed surface is damaged (requires repainting), we need to compute the bounding box for the damage instead of simply using the global x, y of the top-left corner and the surface width, height. Then during surface painting, we take the list of damage rectangles, and render only those. Surface local coordinates (texture coordinates) are computed via the inverse transformation. This method may result in sampling outside of a surface's buffer (texture), so those samples need to be discarded in the fragment shader.

Other things that needed fixing after the surface transformations were window move and resize. Before fixing, moving a surface would not follow the pointer but move in the surface local orientation. Resize needed the same orientation fix, and another fix in relative surface motion that a client can set in the surface's attach request.

What you mostly see as the result of the surface transformations work is, that you can rotate any normal window, no application support needed. The pointer position on screen, over a window, accurately corresponds to what the application receives as the local pointer location. I did not realise it at the time, but this input redirection working flawlessly became an appreciated feature. Apparently it is hard or impossible to do in X, I would not know. In Wayland, and for me, it was just another relatively easy bug to be fixed. The window rotation feature was meant purely for debugging surface transformations.

Two rotated windows and some flowers.
There are still further issues to be fixed with surface transformations. Relative surfaces, like pop-up windows and menus, are not transformed and appear at a wrong location. Pointer cursors are not transformed; you would want the text bar cursor to be aligned with the text orientation. Continuously resizing a transformed window from its (locally) top-left corner makes the window drift away. We are probably still damaging larger regions than absolutely necessary for repaints. Repaint optimisation of opaque surfaces does not work with transformations.

During all this work of four months there were also the usual bug hunts, enhancements and fixes all over. For example, decorationless EGL apps, which turned out to have been a bug in Cairo, and moving the configuration file parser into a helper library that is shared between clients and the compositor.

Now, I am done with the Wayland R&D project and moving into another project at Collabora. In the new project I will continue working on Wayland, Weston, and the demos.
I recently got a Nokia N9 phone. One of the first things I did was copy my music collection into it. Since the player shows also album cover images, if such are stored, I started adding them -- not by embedding them into ID3v2 tags but as separate files, to avoid useless copies of images.

Usually it is as simple as putting a cover.jpg file into a directory, that contains a single album. Sometimes and in some cases, though, that does not work. I found out, that the N9's default music player is supposed to follow Media Art Storage specification. That gave me hints.

If a directory contains more than one album, you can name the cover image files according to the album, for example 'Back in Black.jpg' and 'Flick of the Switch.jpg', as long as the names correspond the ID3 tag album name (somehow?).

My real problem case was a directory full of songs downloaded from Nectarine. I edited them all (EasyTAG is a wonderful tool) to make the ID3 album tag "Nectarine" because I wanted to have them all under the same "album", and there are over 50 songs in that single directory. Simply adding a cover.jpg or Nectarine.jpg did not work.

There are two possible reasons that I found. First, the directory contains too many files, according to the Media Art Storage spec. Second, apparently the cover art is not taken into use, unless at least one song file, which would use that cover art, is touched (modification date updated).

I created a new directory, moved one Nectarine song into it, and put Nectarine.jpg there, too. And it started to work, for all my Nectarine songs.

There is software called Tracker in the N9, which maintains some sort of database of all media. Also album cover art gets used via Tracker. If you ssh into your phone, and move around your media files, Tracker update is not automatically triggered. You could use the command tracker-control -r to force a full rebuild when you launch e.g. the music player the next time, but the rebuild will take a long time. An easy way to force a faster rebuild is to plug the N9 into a computer via USB, and then unplug it.
Now that screen locking is done in Wayland demos, it is time to go for the eye-candy: full-screen idle animations, also known as screensavers. The first step was to port an existing screensaver to Wayland. I chose glmatrix from XScreenSaver, because it is cool, and it renders with OpenGL. This way I did not have to port Xlib based rendering to Cairo (yay!).

Here is GLMatrix running as a regular, windowed application on Wayland, using the toytoolkit:
GLMatrix on the Wayland demo compositor.
On Wayland, screensavers can be reduced to pure animation applications, while the compositor handles everything about locking. Next, we need a Wayland protocol extension to actually use this idle-animation in a screensaver'y way.

GLMatrix is already in the Wayland demo repository as a client called wscreensaver, and it requires cairo-gl, just like gears does.
This is a short and vague glimpse to the interfaces that the Linux kernel offers to user space for display and graphics management, from the history to what is hot and new, to what might perhaps be coming after. The topic came current for me when I started preparing Weston for global thermonuclear war.

The pre-history


In the age of dragons, kernel mode setting did not exist. There was only user space mode setting, where the job of the kernel driver (if any) was simply to give user space direct access to the graphics card registers. A user space driver (well, Xorg video DDX, really, err... or what it was at the time of XFree86) would then poke the card registers to set a mode. The kernel had no idea of anything.

The kernel DRM infrastructure was started as an out-of-tree kernel module for cooperating between multiple programs wanting to access the graphics card's resources. Later it was (partially?) merged into the kernel tree (the year is a lie, 2.3.18 came out in 1999), and much much later it was finally deleted from the libdrm repository.

The middle age


For some time, the kernel DRM existed alongside user space mode setting. It was a dark time full of crazy hacks to keep it all together with duct tape, barbwire and luck. GPUs and hardware accelerated OpenGL started to come up.

The new age


With the invent of kernel mode setting (KMS), the DRM kernel drivers got in charge of the graphics card resources: outputs, video modes, memory allocations, hotplug! User space mode setting became obsolete and was eventually killed. The kernel driver was finally actually in control of the graphics hardware.

KMS probably started with just setting the main framebuffer (primary plane) for each "CRTC" and programming the video mode. A CRTC is for "cathode-ray tube controller", but essentially means a block that reads memory (a framebuffer) and produces a bitstream according to video mode timings. The bitstream is directed into an "encoder", which turns it into a proper physical/analogue signal, like VGA or digital DVI. The signal then exits the graphics card though a "connector". CRTC, encoder, and connector are the basic concepts in KMS API. Quite often these can be combined in some restricted ways, like a single CRTC feeding two encoders for clone mode.

Even ancient hardware supported hardware cursors: a small sprite that was composited into the outgoing video signal on the fly, which meant that it was very cheap to move around. Cursor being so special, and often with funny color format (alpha!), got its very own DRM ioctl.

There were also hardware overlays (additional or secondary planes) on some hardware. While the primary framebuffer covers the whole display, an overlay is another buffer (just like the cursor) that gets mixed into the bitstream at the CRTC level. It is like basic compositing done on the scanout hardware level. Overlays usually had additional benefits, for example they could apply scaling or color space conversion (hello, video players) very efficiently. Overlays being different, they too got their very own DRM ioctls.

The KMS user space ABI was anything but atomic. With the X11 tradition, it wasn't too important how to update the displays, as long as the end result eventually was what you wanted. Race conditions in content updates didn't matter too much either, as X was racy as hell anyway. You update the CRTC. Then you update each overlay. You might update the cursor, too. By luck, all these updates could hit the same vblank. Or not. Or you don't hit vblank at all, and get tearing. No big deal, as X was essentially all about front-buffer rendering anyway. (And then there were huge efforts in trying to fix it all up with X, GLX, Mesa and GL-compositors, and avoid tearing, and it ended up complicated.)

With the advent of X compositing managers, that did not play well with the  awkward X11 protocol (Xv) or the hardware overlays, and with rise of the  GPU power and OpenGL, it was thought that hardware overlays would  eventually die out. Turned out the benefits of hardware overlays were too great to abandon, and with Wayland we again have a decent chance to make the most of them while still enjoying compositing.

The global thermonuclear war (named after a git branch by Rob Clark)


The quality of display updates became important. People do not like tearing. Someone actually wanted to update the primary framebuffer and the overlays on the same vblank, guaranteed. And the cursor as the cherry on top.

We needed one ABI to rule them all.

Universal planes brings framebuffers (primary planes), overlays (secondary planes) and cursors (cursor planes) together under the same API. No more type specific ioctls, but common ioctls shared by them all. As these objects are still somewhat different, overlays having wildly differing features and vendors wanting to expose their own stuff, object properties were invented.

An object property is essentially a {key, value} pair. In the API, the name of a key is a string. Each object has its own set of keys. To use a key, you must know it by name, fetch the handle, and then use the handle when setting the value. Handles seem to be per-object, so make sure to fetch them separately for each.

Atomic mode setting and nuclear pageflip are two sides of the same feature. Atomicity is achieved by gathering a set of property changes, and then pushing them all into the kernel in a single ioctl call. Then that call either succeeds or fails as a whole. Libdrm offers a drmModePropertySet for gathering the changes. Everything is exposed as properties: the attached FB, overlay position, video mode, etc.

Atomic mode setting means setting the output modes of a single graphics device, more or less. Devices may have hard to express limitations. A simple example is the available scanout memory bandwidth: You can drive either two mid-resolution outputs, or one high-resolution output. Or maybe some crtc-encoder-connector combination is not possible with a particular other combination for another output. Collecting the video mode, encoder and connector setup over the whole grahics card into a single operation avoids flicker. Either the whole set succeeds, or it fails. Without atomic mode setting, changing multiple outputs would not only take longer, but if some step failed, you'd have to undo all earlier steps (and hope the undo steps don't fail). Plus, there would be no way to easily test if a certain combination is possible. Atomic mode setting fixes all this.

Nuclear pageflip is about synchronizing the update of a single output (monitor) and making that atomic. This means that when user space wants to update the primary framebuffer, move the cursor, and update a couple of overlays, all those changes happen at the same vblank. Again it all either succeeds or fails. "Every frame is perfect."

And then there shall be ponies (at the end of the rainbow)


Once the global thermonuclear war is over, we have the perfect ABI for driving display updates.

Well, almost. Enter NVidia G-Sync, or AMD's FreeSync which is actually backed by a VESA standard. Dynamically variable refresh rate. We have no way yet for timing display updates in DRM. All we can do is kick out a display update, and it will hopefully land on the next vblank, whenever that is. But we can't tell the DRM when we would like it to be. Everything so far assumes, that the display refresh rate is a constant, apart from an explicit mode switch. Though I have heard that e.g. Chrome for Intel (i915, LVDS/eDP reclocking) has some hacks that opportunistically drops the refresh rate to save power.

There is also a culprit in the DRM of today (Jun 3rd, 2014). You can schedule a pageflip, but if you have pending rendering on that framebuffer for the same GPU as were you are presenting it, the pageflip will not happen until the rendering completes. And you do not know when it will complete, which means you do not know if you will hit the very next vblank or something later.

If the rendering GPU is not the same graphics device that presents the framebuffer, you do not get synchronization at all. That means that you may be scanning out an incomplete rendering for a frame or two, or you have to stall the GPU to make sure it is done before scheduling the page flip. This should be fixed with the fences related to dma-bufs (Hi, Maarten Lankhorst).

And so the unicorn keeps on running.
Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:
  • Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
  • Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
  • Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
  • Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.
Now that Presentation feedback has finally landed in Weston (feedback, flags), people are starting to pay attention to the output timings as now you can better measure them. I have seen a couple of complaints already that Weston has an extra frame of latency, and this is true. I also have a patch series to fix it that I am going to propose.

To explain how the patch series affects Weston's repaint loop, I made some JSON-timeline recordings before and after, and produced some graphs with Wesgr. Here I will explain how the repaint loop works timing-wise.

Original post Feb 11, 2015.
Update Mar 20, 2015: the patches have landed in Weston.


The old algorithm

The old repaint scheduling algorithm in Weston repaints immediately on receiving the pageflip completion event. This maximizes the time available for the compositor itself to repaint, but it also means that clients can never hit the very next vblank / pageflip.

Figure 1. The old algorithm, the client paints as response to frame callbacks.

Frame callback events are sent at the "post repaint" step. This gives clients almost a full frame's time to draw and send their content before the compositor goes to "begin repaint" again. In Figure 1. you see, that if a client paints extremely fast, the latency to screen is almost two frame periods. The frame latency can never be less than one frame period, because the compositor samples the surface contents (the "repaint flush" point) immediately after the previous vblank.

Figure 2. The old algorithm, the client paints as response to Presentation feedback events.

While frame callback driven clients still get to the full frame rate, the situation is worse if the client painting is driven by presentation_feedback.presented events. The intent is to draw and show a new frame as soon as the old frame was shown. Because Weston starts repaint immediately on the pageflip completion, which is essentially the same time when Presentation feedback is sent, the client cannot hit the repaint of this frame and gets postponed to the next. This is the same two frame latency as with frame callbacks, but here the framerate is halved because the client waits for the frame to be actually shown before continuing, as is evident in Figure 2.

Figure 3. The old algorithm, client posts a frame while the compositor is idle.

Figure 3. shows a less relevant case, where the compositor is idle while a client posts a new frame ("damage commit"). When the compositor is idle graphics-wise (the gray background in the figure), it is not repainting continuously according to the output scanout cycle. To start painting again, Weston waits for an extra vblank first, then repaints, and then the new frame is shown on the next vblank. This is also a 1-2 frame period latency, but it is unrelated to the other two cases, and is not changed by the patches.

The modification to the algorithm

The modification is simple, yet perhaps counter-intuitive at first. We reduce the latency by adding a delay. The "delay before repaint" is in all the figures, and the old algorithm is essentially using a zero delay. The compositor's repaint is delayed so that clients have a chance to post a new frame before the compositor samples the surface contents.

A good amount of delay is a hard question. Too small delay and clients do not have time to act. Too long delay and the compositor itself will be in danger of missing the vblank deadline. I do not know what a good amount is or how to derive it, so I just made it configurable. You can set the repaint window length in milliseconds in weston.ini. The repaint window is the time from starting repaint to the deadline, so the delay is the frame period minus the repaint window. If the repaint window is too long for a frame period, the algorithm will reduce to the old behaviour.

The new algorithm

The following figures are made with a 60 Hz refresh and a 7 millisecond repaint window.

Figure 4. The new algorithm, the client paints as response to frame callback.

When a client paints as response to the frame callback (Figure 4), it still has a whole frame period of time to paint and post the frame. The total latency to screen is a little shorter now, by the length of the delay before compositor's repaint. It is a slight improvement.

Figure 5. The new algorithm, the client paints as response to Presentation feedback.

A significant improvement can be seen in Figure 5. A client that uses the Presentation extension to wait for a frame to be actually shown before painting again is now able to reach the full output frame rate. It just needs to paint and post a new frame during the delay before compositor's repaint. This mode of operation provides the shortest possible latency to screen as the client is able to target the very next vblank. The latency is below one frame period if the deadlines are met.

Discussion

This is a relatively simple change that should reduce display latency, but analyzing how exactly it affects things is not trivial. That is why Wesgr was born.

This change does not really allow clients to wait some additional time before painting to reduce the latency even more, because nothing tells clients when the compositor will repaint exactly. The risk of missing an unknown deadline grows the later a client paints. Would knowing the deadline have practical applications? I'm not sure.

These figures also show the difference between the frame callback and Presentation feedback. When a client's repaint loop is driven by frame callbacks, it maximizes the time available for repainting, which reduces the possibility to miss the deadline. If a client drives its repaint loop by Presentation feedback events, it minimizes the display latency at the cost of increased risk of missing the deadline.

All the above ignores a few things. First, we assume that the time of display is the point of vblank which starts to scan out the new frame. Scanning out a frame actually takes most of the frame period, it's not instantaneous. Going deeper, updating the framebuffer during scanout period instead of vblank could allow reducing latency even more, but the matter becomes complicated and even somewhat subjective. I hear some people prefer tearing to reduce the latency further. Second, we also ignore any fencing issues that might come up in practise. If a client submits a GPU job that takes a long while, there is a good chance it will cause everything to miss a deadline or more.

As usual, this work and most of the development of JSON-timeline and Wesgr were sponsored by Collabora.

PS. Latency and timing issues are nothing new. Owen Taylor has several excellent posts on related subjects in his blog.
How is an uncompressed raster image laid out in computer memory? How is a pixel represented? What are stride and pitch and what do you need them for? How do you address a pixel in memory? How do you describe an image in memory?

I tried to find a web page for dummies explaining all that, and all I could find was this. So, I decided to write it down myself with the things I see as essential.


An image and a pixel

Wikipedia explains the concept of raster graphics, so let us take that idea as a given. An image, or more precisely, an uncompressed raster image, consists of a rectangular grid of pixels. An image has a width and height measured in pixels, and the total number of pixels in an image is obviously width×height.

A pixel can be addressed with coordinates x,y after you have decided where the origin is and which way the coordinate axes go.

A pixel has a property called color, and it may or may not have opacity (or occupancy). Color is usually described as three numerical values, let us call them "red", "green", and "blue", or R, G, and B. If opacity (or occupancy) exists, it is usually called "alpha" or A. What R, G, B, and A actually mean is irrelevant when looking at how they are stored in memory. The relevant thing is that each of them is encoded with a certain number of bits. Each of R, G, B, and A is called a channel.

When describing how much memory a pixel takes, one can use units of bits or bytes per pixel. Both can be abbreviated as "bpp", so be careful which one it is and favour more explicit names in code. Also bits per channel is used sometimes, and channels can have a different number of bits per pixel each. For example, rgb565 format is 16 bits per pixel, 2 bytes per pixel, 5 bits per R and B channels, and 6 bits per G channel.

A pixel in memory

Pixels do not come in arbitrary sizes. A pixel is usually 32 or 16 bits, or 8 or even 1 bit. 32 and 16 bit quantities are easy and efficient to process on 32 and 64 bit CPUs. Your usual RGB-image with 8 bits per channel is most likely in memory with 32 bit pixels, the extra 8 bits per pixel are simply unused (often marked with X in pixel format names). True 24 bits per pixel formats are rarely used in memory because trading some memory for simpler and more efficient code or circuitry is almost always a net win in image processing. The term "depth" is often used to describe how many significant bits a pixel uses, to distinguish from how many bits or bytes it occupies in memory. The usual RGB-image therefore has 32 bits per pixel and a depth of 24 bits.

How channels are packed in a pixel is specified by the pixel format. There are dozens of pixel formats. When decoding a pixel format, you first have to understand if it is referring to an array of bytes (particularly used when each channel is 8 bits) or bits in a unit. A 32 bits per pixel format has a unit of 32 bits, that is uint32_t in C parlance, for instance.

The difference between an array of bytes and bits in a unit is the CPU architecture endianess. If you have two pixel formats, one written in array of bytes form and one written in bits in a unit form, and they are equivalent on big-endian architecture, then they will not be equivalent on little-endian architecture. And vice versa. This is important to remember when you are mapping one set of pixel formats to another, between OpenGL and anything else, for instance. Figure 1 shows three different pixel format definitions that produce identical binary data in memory.

Figure 1. Three equivalent pixel formats with 8 bits for each channel. The writing convention here is to list channels from highest to lowest bits in a unit. That is, abgr8888 has r in bits 0-7, g in bits 8-15, etc.

It is also possible, though extremely rare, that architecture endianess also affects the order of bits in a byte. Pixman, undoubtedly inheriting it from X11 pixel format definitions, is the only place where I have seen that.

An image in memory

The usual way to store an image in memory is to store its pixels one by one, row by row. The origin of the coordinates is chosen to be the top-left corner, so that the leftmost pixel of the topmost row has coordinates 0,0. First there are all the pixels of the first row, then the second row, and so on, including the last row. A two-dimensional image has been laid out as a one-dimensional array of pixels in memory. This is shown in Figure 2.

Image layout in memory.
Figure 2. The usual layout of pixels of an image in memory.
There are not only the width×height number of pixels, but each row also has some padding. The padding area is not used for storing anything, it only aligns the length of the row. Having padding requires a new concept: image stride.

Padding is often necessary due to hardware reasons. The more specialized and efficient hardware for pixel manipulation, the more likely it is that it has specific requirements on the row start and length alignment. For example, Pixman and therefore also Cairo (image backend particularly) require that rows are aligned to 4 byte boundaries. This makes it easier to write efficient image manipulations using vectorized or other instructions that may even process multiple pixels at the same time.

Stride or pitch

Image width is practically always measured in pixels. Stride on the other hand is related to memory addresses and therefore it is often given in bytes. Pitch is another name for the same concept as stride, but can be in different units.

You may have heard rules of thumb that stride is in bytes and pitch is in pixels, or vice versa. Stride and pitch are used interchangeably, so be sure of the conventions used in the code base you might be working on. Do not trust your instinct on bytes vs. pixels here.

Addressing a pixel

How do you compute the memory address of a given pixel x,y? The canonical formula is:
pixel_address = data_begin + y * stride_bytes + x * bytes_per_pixel.
The formula stars with the address of the first pixel in memory data_begin, then skips to row y while each row is stride_bytes long, and finally skips to pixel x on that row.

In C code, if we have 32 bit pixels, we can write
uint32_t *p = data_begin;
p += y * stride_bytes / sizeof(uint32_t);
p += x;
Notice, how the type of p affects the computations, counting in units of uint32_t instead of bytes.

Let us assume the pixel format in this example is argb8888 which is defined in bits of a unit form, and we want to extract the R value:
uint32_t v = *p;
uint8_t r = (v >> 16) & 0xff;
Finally, Figure 3 gives a cheat sheet.

Figure 3. How to compute the address of a pixel.

Now we have covered the essentials, and you can stop reading. The rest is just good to know.

Not everyone has the "right" way up

In the above we have assumed that the image origin is the top-left corner, and rows are stored top-most first. The most notable exception to this is the OpenGL API, which defines image data to be in bottom-most row first. (Traditionally also BMP file format does this.)

Multi-planar formats

In the above, we have talked about single-planar formats. That means that there is only a single two-dimensional array of pixels forming an image. Multi-planar formats use two or more two-dimensional arrays for forming an image.

A simple example with an RGB-image would be to store R channel in the first plane (2D-array) and GB channels in the second plane. Pixels on the first plane have only R value, while pixels on the second plane have G and B values. However, this example is not used in practice.

Common and real use cases for multi-planar images are various YUV color formats. Y channel is stored on the first plane, and UV channels are stored on the second plane, for instance. A benefit of this is that e.g. the UV plane can be sub-sampled - its resolution could be only half of the plane with Y, saving some memory.

Tiled formats

If you have read about GPUs, you may have heard of tiling or tiled formats (tiled renderer is a different thing). These are special pixel layouts, where an image is not stored row by row but a rectangular block by block. Tiled formats are far too wild and various to explain here, but if you want a taste, take a look at Nouveau's documentation on G80 surface formats.
Now is a high time to start discussing what you might want to do, for both student candidates and possible mentors.

Students, have a look at our project idea examples to get a feeling of what kind of projects you could propose. First you will need to contribute at least a small but significant patch to show that you understand the workflow, we have put some first task ideas together.

There are our application instructions for students. Of course all the pages are reachable from the Wayland GSoC wiki page and also the Wayland organization page.

If you want to become a mentor, please contact me or Kat, the contact details are on the Wayland GSoC wiki page.

Note, that students can also apply under the X.Org Foundation organization since Wayland is within their scope too and they also have other excellent graphics project ideas. You are welcome to submit your Wayland proposals to both projects.
I have recently been occupied with a new project (and being with a cold all this week), so I have not been much present in the Wayland community. Now I can finally say what I and Emilio have been up to: Waltham! For more information, please see our annoucement.
This is continuation to my Wayland desktop-shell post.

My goal was to implement a simple screen locking feature, a similar idea to what xlockmore does for X. In Wayland it is much simpler and more reliable to implement than in X, because the implementation will be in the display server (compositor). While the "lock" itself is in the compositor, also an unlock dialog is required. The unlock dialog usually asks the user to input his password, but I settled for "click the green ball". Screenshots below...

First a protocol (commit) is needed to drive the compositor locking and unlocking, since the unlock dialog is exported to the desktop-shell client. When the compositor hits the idle timeout, it fades out to black, and then locks itself in shell plugin. The compositor is woken up by input events, and sends prepare_lock_surface event to desktop-shell. The client replies with set_lock_surface request, with the unlock dialog's surface as an argument. Only on getting the surface, the compositor fades in, to have a nice transition to the dialog. The dialog then runs like any other application on screen, and when the user has dismissed it, desktop-shell sends unlock request to the compositor. On unlock, compositor brings all windows (surfaces) back to the desktop.

The shell plugin implements screen locking by stealing all the surfaces from the compositor's rendering list. Only the background surface and pointer cursor surfaces are left. This has the side-effect that none of the stolen (hidden) surfaces can be activated nor receive input. The compositor-side surface objects still continue living as usual. New surfaces can be created and they are automatically hidden. Output assignment of the hidden surfaces is set to NULL, which prevents sending any frame events for them, effectively also stopping any animations that might have been running. On unlock, the surfaces are simply put back into the compositor's list, and assigned to outputs.

After the last commit in the screen locking series, you can enjoy automatic screen locking in the Wayland demo compositor:
Normal desktop.




Locked, with the unlock dialog.



Note, that locking does not imply a fancy animated screensaver. The black screen is the screensaver ;-)

Thanks to Kristian Høgsberg for his reviews, comments and bug fixes.

This feature is sponsored by Collabora, Ltd.
Continuing on the Wayland screensaver track, I sent a branch for review. The screensaver interface is now fully implemented in both the demo compositor and the demo screensaver. Screenshots below...

The compositor shell plugin of desktop-shell now implements the screensaver interface. This allows a client to register a surface as an idle animation for a given output (monitor). These surfaces remain hidden until the compositor's idle timer triggers, and the compositor fades to black. If there are any screensaver surfaces, the compositor will fade them in, showing the idle animation. The compositor can also be configured to automatically start a screensaver client.

While an idle animation is running, if the shell implements screen locking, the unlock dialog will appear on the first input event, for instance when moving the mouse. The idle animation continues as the background for the unlock screen.

There is another idle timeout running with the idle animation. When that timeout triggers, the compositor fades to black and will seize updating the screen. This also causes properly written animating clients to stop rendering, and we can hit zero CPU usage, even when there is a screensaver active. The compositor will wake from this sleep as usual, and fade in either the desktop directly, or the unlock dialog with the animation in the background.

On returning to the normal desktop, the compositor (the shell plugin, really) will kill the screensaver client if it started it in the first place.

The demo implementation also supports multiple outputs, which is convenient to demonstrate on X. The three Wayland compositor windows are the outputs of a single demo compositor running.

Normal desktop, spread over three outputs, with a few flower clients and a terminal.

The idle animations running on each output with separate state. There is only one screensaver client running.

Idle animation as the background for the unlock screen.
It is time to announce the android-4.0.1_r1.2-b snapshot release of the Wayland on Android project at Collabora! We give you: input support in Weston and a finger-painting demo!

Collabora will have people at GUADEC demoing this on real devices, though not me personally.

Click to see the video!



This release provides ports of the following projects (git repositories, really) to Android 4.0.1 on Samsung Galaxy Nexus:
It also includes some changes to Android internals, and the aggregate for building it all.

This is just a snapshot release of a work in progress, and you cannot do much with it. Everything an end user would have known about Android is still gone.

In Weston, the three device buttons are working, and the touchscreen is working. Unfortunately, the only application really supporting touch devices is simple-touch, but I turned that into a demo that is automatically launched. If you install this release into a Galaxy Nexus device, it will boot into Weston and you can play with simple-touch. The power button is hooked up in Weston to power off the device immediately, so a computer is not necessary to show and exit the demo.

The main advancement compared to my previous posts is that the touchscreen is fully working now. Also, this time I am providing a proper release:
You can get the fastboot tool needed for flashing the images from the Android SDK, I think. I have never used the SDK myself, I have always gone with the full AOSP tree.

Please, if you try this on your device, let me know how it went. If you find problems that I can fix, I might push the fixes to the android-4.0.1_r1.2-b release branches, and update the ChangeLog for this release, but I will not provide new images. Before August I probably won't react, though.

If you look at the histories of the git branches mentioned towards the top, you will find many ugly hacky commits. All commits marked as HACK will be replaced by the proper changes during the course of this project. We are planning to send almost all changes to respective upstream projects, too. The input enablement patch series in Weston needs a rewrite, before it gets upstream.


Thanks to the whole Android team at Collabora for making this happen!
We at Collabora have been working on a new Android build system integration with autotools projects, still based on Androgenizer (git). Now we have our own repo manifest repository, and a tool called anagrman for managing optional feature packages (aggregates). Wayland on Android is one feature package, and the first to become available. We also upgraded to Ice Cream Sandwhich 4.0.4_r2.1. Instead of a snapshot release, this particular announcement is about live branches.

Weston (git) was upgraded to upstream as of Sept. 18th, 2012, though there are no user visible changes on Android. This brings the 0.95 protocol, and the new evdev input rework from upstream, which went through some changes since the 4.0.1_r1.2-b Wayland on Android release. Weston now has its GLESv2 renderer separated in code, and shaders are faster and simpler (I have not done any shader benchmarks myself).

Libxkbcommon lost its final dependencies to kbproto and xproto, and our Android build files are upstream. Thanks to Daniel Stone, we can use libxkbcommon straight from upstream.

The new Android build system integration requires you to manually download only anagrman in addition to Android repo. There is no make wayland-aggregate-configure step anymore, all generated Android makefiles are created during the full build. Also androgenizer and wayland-scanner are built automatically as needed. All this is possible using the makefile update feature of GNU Make. If there is a rule that can update a makefile, GNU Make will update the makefile as needed. If any makefiles were updated, Make will then start from scratch, reading in all makefiles again, before continuing to the actual build phase. This causes Make to reload all Android makefiles 2-3 times during the first build. It should also solve any dependency issues of the explicit configure step, like when one project's configure depends on another project's fully built library. A big thank you to Helio Chissini de Castro for doing most of the build system work.

This work is available in two ways:
The ready-made image is configured to launch Weston at boot instead of SurfaceFlinger (the setting is in device/samsung/maguro/system.prop in the source tree). The source code, however, does not start Weston nor SurfaceFlinger automatically. You have to use the commands # adb root and # adb shell to log into the phone, and run one of:
  • # setprop service.compositor surfaceflinger
  • # setprop service.compositor weston
You can also run Weston by just # weston & and start other demo clients manually. Weston will automatically start simple-touch finger drawing demo, and the power button will cause Weston to power off the phone. The available demo apps are: simple-touch, simple-shm, flower, and clickdot. Any GL based demos are not included, since the EGL Wayland platform for clients is still unimplemented.

Even though this is a live release, i.e. not tagged to a specific revision, I do not expect much changes in the near future. We are now researching other ways to enable Wayland on Android and other embedded-like devices.
How would one start implementing support for graphics hardware accelerated Wayland clients on an embedded platform that only has proprietary drivers?

This is a question I have answered more than once recently. Presumably you already have some ways to implement a Wayland compositor, some APIs you can use to composite and get images on screen. You may have wl_shm based clients already working, and now you want hardware rendered clients to work. Where to start?

First I will explain the architecture a bit. There are basically three things related to Wayland graphics:
  • the client side support for graphics hardware acceleration
  • the compositor's support for hardware accelerated clients
  • the compositor's own rendering or compositing, and output to screen

Usually the graphics hardware accelerated applications (clients) use EGL for the window system glue, and GL ES (2) for rendering.  The client side is the EGL Wayland platform, plus wayland-egl API. The EGL Wayland platform means, that you can pass Wayland types as EGL native types, for instance a struct wl_display * as the EGLNativeDisplayType parameter of the eglGetDisplay() function.

The compositor's support for hardware accelerated clients is the server side of Wayland-enabled libEGL. Normally it consists of the EGL_WL_bind_wayland_display extension. For a compositor programmer, this extension allows you to create an EGLImageKHR object from a struct wl_buffer *, and then bind that as a GL texture, so you can composite it.

The compositor's own rendering mechanisms are largely irrelevant to the client support. The only requirement is, that the compositor can effectively use the kinds of buffers the clients send to it. If you turn a wl_buffer object via EGLImageKHR into a GL texture, you would better be compositing with a GL API, naturally. Apart from that, it does not matter what APIs the compositor uses for compositing and displaying.

Now, what do we actually need for supporting hardware accelerated clients?

First, forget about using wl_shm buffers, they are not suitable for hardware accelerated clients. Buffers that GPUs render into are often badly suited for CPU usage, or not directly accessible by the CPU at all. Due to GPU requirements, you likely cannot make a GPU to render into an shm buffer, either. Therefore to get the pixel data into an shm buffer you would need to do a copy, like glReadPixels(). Then you send the shm buffer to the server, and the server needs to copy the pixels again to make them accessible to the GPU for compositing, e.g. by calling glTexImage2D(). That is two copies between CPU and GPU domains, and that is slow. I would say unusably slow. It is far better to not move the pixels into CPU domain at all, and avoid all copying.

Therefore, the most important thing is graphics buffer sharing or passing. Buffer sharing works by creating a handle for a buffer, and passing that handle to another process which then uses the handle to make the GPU access again the same buffer. On your graphics platform, find out:
  • Do such handles exist at all?
  • How do you create a buffer and a handle?
  • How do you direct GL ES rendering into that buffer?
  • What is the handle? Does it contain integers, open file descriptors, or opaque pointers? Integers and file descriptors are not a problem, but you cannot pass (host virtual) pointers from one process to another.
  • How do you create something usable, like an EGLImageKHR or a GL texture, from the handle?
It would be good to test that the buffer passing actually works, too.

Once you know what the handle is, and whether clients can allocate their own buffers (preferred), or must the compositor hand out buffers to clients for some obscure security reasons, you can think about how to use the Wayland protocol to pass buffers around. You must invent a new Wayland protocol extension. The extension should allow a client to create a wl_buffer object from the handle. All the standard Wayland interfaces deal with wl_buffer objects, and the server will detect the type of each wl_buffer object when used by a client. Examples of the protocol extension are wl_drm of Mesa, and my experimental android_wlegl.

I recommend you do the first implementation of the protocol extension completely ad hoc. Hack the server to work with your buffer types, and write a custom client that directly uses the protocol without any fancy API like wayland-egl. Once you confirm it works, you can design the real implementation, whether it should be in a wrapper library around the proprietary libEGL or something else.

EGL is the standard interface to implement accelerated Wayland client support and it conveniently hides the details from both servers and clients, but it is not the only way. If you control both server and client code, you can use any other API, or create your own. That is a big if, though.

The key point is buffer sharing, copying will kill your system performance. After you know how to share graphics buffers, your work has only begun. Good luck!

I have a Logitech DiNovo Mini (combined keyboard & touchpad), which at first worked just fine on my Gentoo laptop, Asus G50V, using the laptop's built-in bluetooth adapter, Bluez major version 4 (4.101-r5 today), and manual connection. Then I tried to connect the DiNovo to other devices, both without and with the USB-bluetooth-dongle that came with the DiNovo. Then I wanted it back to my laptop. There was a time when it worked only if I temporarily removed the battery from DiNovo. In the end, after several weeks if not months, it did not work anymore, at all. Blindly poking around, I now found how to fix it.

The DiNovo showed a green light, saying it got a connection, but on the laptop, all I could see was the device appearing and very soon disappearing in udev (confirmed with udevadm monitor). I tried to pair it again, many times, and while the pairing seemed to succeed, the device just did not work. I also installed blueman, which indicated the same: when I touched the DiNovo, it connected, and was immediately disconnected. In the system log I got:
bluetoothd[280]: Refusing input device connect: Operation already in progress (114)
bluetoothd[280]: Refusing connection from ##:##:##:##:##:##: setup in progress
(bluetoothd:280): GLib-WARNING **: Invalid file descriptor.
Ok, so is it trying to connect twice for some odd reason? Where is the state kept? Could I manually fix it?

Apparently, I could. In the /var/lib/bluetooth/*/ directory I saw several files that seemed to be about the bluetooth settings on my Gentoo. Not knowing anything about how Bluez works, I looked at the files there, to see if I could spot something suspicious. Luckily they were all plain-enough text files, so I did spot something.

The file /var/lib/bluetooth/*/spd had two lines with my DiNovo's device address on it. The first line was a long one, the second line short. Not knowing what I'm doing, I stopped the bluetooth service, removed the short line, and restarted the bluetooth service. Like magic, the DiNovo started working again, connecting automatically. No errors in the system log anymore, either.

I have not used the DiNovo much after the repair yet, remains to be seen if I broke anything, but so far so good. Apparently when I was playing around with the DiNovo, somehow that file got a second entry for the same device, and caused malfunction. Is it my fault or a bug, I do not know. Googling did not give any helpful hints on solving this, so I am recording this note here, hoping it helps someone.

-- A note, barely readable, scratched with a broken SD-card on the wall of a passageway in the huge, monster crawling dungeon they call the Intternets.
Raspberry Pi is a nice tiny computer with a relatively powerful VideoCore graphics processor, and an ARM core bolted on the side running Linux. Around October 2012 I was bringing Wayland to it, and in November the Weston rpi-backend was merged upstream. Unfortunately, somehow I did not get around to write about it. In spring 2013 I did a follow-on project on the rpi-backend for the Raspberry Pi Foundation as part of my work for Collabora. We are now really pushing Wayland forward on the Raspberry Pi, and strengthening Collabora's Wayland expertise on all fronts. In the following I will explain what I did and how the new rpi-backend for Weston works in technical terms. If you are more interested in why this was done, I refer you to the excellent post by Daniel Stone: Weston on Raspberry Pi.

Bringing Wayland to Raspberry Pi in 2012

Raspberry Pi has EGL and GL ES 2 support, so the easiest way to bring Wayland was to port Weston. Fortunately unlike most Android-based devices, Raspberry Pi supports normal Linux distributions, and specifically Raspbian, which is a variant of Debian. That means very standard Linux desktop stuff, and easy to target. Therefore I only had to write a new Raspberry Pi specific backend to Weston. I could not use any existing backend, because the graphics stack does not support DRM nor GBM, and running on top of (fbdev) X server would void the whole point. No other usable backends existed at the time.

The proprietary graphics API on RPi is Dispmanx. Dispmanx basically offers a full 2D compositor, but since Weston composited with GL ES 2, I only needed enough Dispmanx to get a full-screen surface for EGL. Half of the patch was just boilerplate to support input and VT handling. All that was fairly easy, but left the Dispmanx API largely unused, not hooking up to the real performance of the VideoCore. Sure, GL ES 2 is accelerated on the VideoCore, too, but it is a much more complex API.

I continued to take more advantage of the hardware compositor Dispmanx exposes. At the time, the way to do that was to implement support for Weston planes. Weston planes were developed for taking advantage of overlay hardware. A backend can take suitable surfaces out from the scenegraph and composite them directly in hardware, bypassing the GL ES 2 renderer of Weston. A major motivation behind it was to offload video display to dedicated hardware, and avoid YUV-RGB color conversion and scaling in GL shaders. Planes allow also the use of hardware cursors.

The hardware compositor on RPi is partially firmware-based. This means that it does not have a constant limit in number of overlays. Standard PC hardware has at most a few overlays if any, the hardware cursor included. The RPi hardware however offers a lot more. In fact, it is possible to assign all surfaces into overlay elements. That is what I implemented, and in an ideal case (no surface transformations) I managed to put everything into overlay elements, and the GL renderer was left with nothing to do.

The hardware compositor does have its limitations. It can do alpha blending, but it cannot rotate surfaces. It also does have a limit on how many elements it can handle, but the actual number depends on many things. Therefore, I had an automatic fallback to the GL renderer. The Weston plane infrastructure made that very easy.

The fallback had some serious downsides, though. There was no way to synchronize all the overlay elements with the GL rendering, and switches between fallback and overlays caused glitches. What is worse, memory consumption exploded through the roof. We only support wl_shm buffers, which need to be copied into GL textures and Dispmanx resources (hardware buffers). As we would jump between GL and overlays arbitrarily and per surface, and I did not want to copy each attached buffer to both of texture and resouce, I had to keep the wl_shm buffer around, just in case it needs to jump and copy as needed. That means that clients will be double-buffered, as they do not get the buffer back until they send a new one. In Dispmanx, the elements, too, need to be double-buffered to ensure that there cannot be glitches, so they needed two resources per element. In total, that means 2 wl_shm buffers, 1 GL texture, and 2 resources. That is 5 surface-sized buffers for every surface! But it worked.

The first project ended, and time passed. Weston got the pixman-renderer, and the renderer interfaces matured. EGL and GL were decoupled from the Weston core. This made the next project possible.

Introducing the Rpi-renderer in Spring 2013

Since Dispmanx offers a full hardware compositor, it was decided that the GL renderer is dropped from Weston's rpi-backend. We lose arbitrary surface transformations like rotation, but on all other aspects it is a win: memory usage, glitches, code and APIs, and presumably performance and power consumption. Dispmanx allows scaling, output transforms, and alpha channel mixed with full-surface alpha. No glitches as we do not jump between GL and overlays anymore. All on-screen elements can be properly synchronized. Clients are able to use single buffering. The Weston renderer API is more complete than the plane API. We do not need to manipulate complex GL state and create vertex buffers, or run the geometry decimation code; we only compute clips, positions, and sizes.

The rpi-backend's plane code had all the essential bits for Dispmanx to implement the rpi-renderer, so lots of the code was already there. I took me less than a week to kick out the GL renderer and have the rpi-renderer show the desktop for the first time. The rest of a month's time was spent on adding features and fixing issues, pretty much.

Graphics Details

Rpi-backend

The rpi-renderer and rpi-backend are tied together, since they both need to do their part on driving the Dispmanx API. The rpi-backend does all the usual stuff like opens evdev input devices, and initializes Dispmanx. It configures a single output, and manages its updates. The repaint callback for the output starts a Dispmanx update cycle, calls into the rpi-renderer to "draw" all surfaces, and then submits the update.

Update submission is asynchronous, which means that Dispmanx does a callback in a different thread, when the update is completed and on screen, including the synchronization to vblank. Using a thread is slightly inconvenient, since that does not plug in to Weston's event loop directly. Therefore I use a trick: rpi_flippipe is essentially a pipe, a pair of file descriptors connected together. Write something into one end, and it pops out the other end. The callback rpi_flippipe_update_complete(), which is called by Dispmanx in a different thread, only records the current timestamp and writes it to the pipe. The other end of the pipe has been registered with Weston's event loop, so eventually rpi_flippipe_handler() gets called in the right thread context, and we can actually handle the completion by calling rpi_output_update_complete().

Rpi-renderer

Weston's renderer API is pretty small:
  • There are hooks for surface create and destroy, so you can track per-surface renderer private state.
  • The attach hook is called when a new buffer is committed to a surface.
  • The flush_damage hook is called only for wl_shm buffers, when the compositor is preparing to composite a surface. That is where e.g. GL texture updates happen in the GL renderer, and not on every commit, just in case the surface is not on screen right now.
  • The surface_set_color callback informs the renderer that this surface will not be getting a buffer, but instead it must be painted with the given color. This is used for effects, like desktop fade-in and fade-out, by having a black full-screen solid color surface whose alpha channel is changed.
  • The repaint_output is the workhorse of a renderer. In Weston core, weston_output_repaint() is called for each output when the output needs to be repainted. That calls into the backend's output repaint callback, which then calls the renderer's hook. The renderer then iterates over all surfaces in a list, painting them according to their state as needed.
  • Finally, the read_pixels hook is for screen capturing.
The rpi-renderer per-surface state is struct rpir_surface. Among other things, it contains a handle to a Dispmanx element (essentially an overlay) that shows this surface, and two Dispmanx resources (hardware pixel buffers); the front and the back. To show a picture, a resource is assigned to an element for scanout.

The attach callback basically only grabs a reference to the given wl_shm buffer. When Weston core starts an output repaint cycle, it calls flush_damage, where the buffer contents are copied to the back resource. Damage is tracked, so that in theory, only the changed parts of the buffer are copied. In reality, the implementation of vc_dispmanx_resource_write_data() does not support arbitrary sub-region updates, so we are forced to copy full scanlines with the same stride as the resource was created with. If stride does not match, the resource is reallocated first. Then flush_damage drops the wl_shm buffer reference, allowing the compositor to release the buffer, and the client can continue single-buffered. The pixels are saved in the back resource.

Copying the buffer involves also another quirk. Even though the Dispmanx API allows to define an image with a pre-multiplied alpha channel, and mix that with a full-surface (element) alpha, a hardware issue causes it to produce wrong results. Therefore we cannot use pre-multiplied alpha, since we want the full-surface alpha to work. This is solved by setting the magic bit 31 of the pixel format argument, which causes vc_dispmanx_resource_write_data() to un-pre-multiply, that is divide, the alpha channel using the VideoCore. The contents of the resource become not pre-multiplied, and mixing with full-surface alpha works.

The repaint_output callback first recomputes the output transformation matrix, since Weston core computes it in GL coordinate system, and we use framebuffer coordinates more or less. Then the rpi-renderer iterates over all surfaces in the repaint list. If a surface is completely obscured by opaque surfaces, its Dispmanx element is removed. Otherwise, the element is created as necessary and updated to the new front resource. The element's source and destination pixel rectangles are computed from the surface state, and clipped by the resource and the output size. Also output transformation is taken into account. If the destination rectangle turns out empty, the element is removed, because every existing Dispmanx element requires VideoCore cycles, and it is best to use as few elements as possible. The new state is set to the Dispmanx element.

After all surfaces in the repaint list are handled, rpi_renderer_repaint_output() goes over all other Dispmanx elements on screen, and removes them. This makes sure that a surface that was hidden, and therefore is not in the repaint list, will really get removed from the screen. Then execution returns to the rpi-backend, which submits the whole update in a single batch.

Once the update completes, the rpi-backend calls rpi_renderer_finish_frame(), which releases unneeded Dispmanx resources, and destroys orphaned per-surface state. These operations cannot be done any earlier, since we need to be sure the related Dispmanx elements have really been updated or removed to avoid possible visual glitches.

The rpi-renderer implements surface_set_color by allocating a 1×1 Dispmanx resource, writing the color into that single pixel, and then scaling it to the required size in the element. Dispmanx also offers a screen capturing function, which stores a snapshot of the output into a resource.

Conclusion

While losing some niche features, we gained a lot by pushing all compositing into the VideoCore and the firmware. Memory consumption is now down to a reasonable level of three buffers per surface, or just two if you force single-buffering of Dispmanx elements. Two is on par with Weston's GL renderer on DRM. We leverage the 2D hardware for compositing directly, which should perform better. Glitches and jerks should be gone. You may still be able to cause the compositing to malfunction by opening too many windows, so instead of the compositor becoming slow, you get bad stuff on screen, which is probably the only downside here. "Too many" is perhaps around 20 or more windows visible at the same time, depending.

If the user experience of Weston on Raspberry Pi was smooth earlier, especially compared to X (see the video), it is even smoother now. Just try the desktop zoom (Win+MouseWheel), for instance! Also, my fellow collaborans wrote some new desktop effects for Weston in this project. Should you have a company needing assistance with Wayland, Collabora is here to help.

The code is available in the git branch raspberrypi-dispmanx, and in the Wayland mailing list. On May 23rd, 2013, the Raspberry Pi specific patches are already merged upstream, and the demo candy patches are waiting for review.

Further related links:
Raspberry Pi Foundation, Wayland preview
Collabora, press release
In the last two (or three?) weeks at Collabora I have been looking into a Wayland protocol extension that would allow accurately timed presentation. Accurate timing is essential for two quite different use cases: video playback with audio/video synchronization, and interactive GUI with just-in-time redrawing. Video playback with A/V sync was our primary goal when we started working on this, and there is Frederic Plourde's first proposal from October 2013. Since then I have realized that also other kinds of applications need timings, especially feedback on when their content updates were shown, and when is the next chance to show an update (vblank). Primarily my re-design started with the aim to improve resizing performance when I got the assignment from Daniel Stone to push Wayland presentation protocol forward. The RFC v2 of Wayland presentation extension is now out for review and discussion.

I looked at various timing and content posting related APIs, like EGL and its extensions, GLX_OML_sync_control, and the new X11 Present and Keith's blog posts about it. I found a couple of things I could not understand what they were for and asked about them on a few mailing lists. The replies and further pondering resulted in a conclusion that I do not have to support the MSC modulus matching logic, and eglSwapInterval for intervals greater than one could be implemented if anyone really needs it.

I took a more detailed look at X11 Present for XWayland purposes. The Wayland presentation protocol extension I am proposing is not meant to solve all problems in supporting X11 Present in XWayland, but the investigation gave me some faith that with small additional XWayland extensions it could be done. Axel Davy is already implementing Present on core Wayland protocol as far as it is possible anyway, and we had lots of interesting discussions.

I am not going into any details of the RFC v2 proposal here, as the email should contain exhaustive documentation on the design. If no significant flaws are found, the next steps would be to implement this in Weston and see how it works.
Wayland sub-surfaces is a feature that has been brewing for a long long time, and finally it has made it into Wayland core in a recent commit + the Weston commit. The design for sub-surfaces started some time in December 2012, when the task was given to me at Collabora. It went through several RFCs and was finally merged into Weston in May 2013. After that there have been only small changes if any, and sub-surfaces matured (Or was forgotten? I had other things to do.) over several months. Now it is coming out in Wayland 1.4 (plan), but what is it really?

Introduction

The basic visual (and UI) building block in Wayland (the protocol) is a wl_surface. Basically everything on screen is represented as wl_surfaces in the protocol: mouse cursors, windows, icons, etc. A surface gets its content and size by attaching a wl_buffer to it, which is a handle to a pixel container. A surface has many attributes, like the input region: the region of the surface where it can receive input events. Input events, e.g. pointer motion, that happen on the surface but outside of the input region get directed to what is below the surface. The input region can be empty, but it cannot extend beyond the surface dimensions.

It so happens, that cursor, shell surface (window), and drag icon are also surface roles. Under a desktop shell, a surface cannot become visible (mapped) unless it has a role, and it fills the requirements of that particular role. For example, a client can set a cursor surface only when it has the pointer focus. Without a role the compositor would not know what do with a surface. Roles are exclusive: a surface can have only one role at a time. How a role is assigned depends on the protocol for the particular role, there is no generic set_role-interface.

A window is a wl_surface with a suitable shell role, there is no separate object type "window" in the protocol. A window being a single wl_surface means that its contents must come from a single wl_buffer at a time. For most applications that is just fine, but there are few exceptions where it makes things less than optimal when you want to take advantage of hardware acceleration features to the fullest.

The problem

Let us consider a video player in a window. Window decorations and GUI elements are usually rendered in an RGB color format on the CPU. Video usually decodes into some YUV color format. To create one complete wl_buffer for the window, the application must merge these: convert the video into RGB and combine it with the GUI elements. And it has to do that for every single video frame, whether the GUI elements change or not. This causes several performance penalties. If your graphics card is capable of showing YUV-formatted content directly in an overlay, you cannot take advantage of that. If you have video decoding hardware, you probably have to access and copy the produced YUV images with the CPU, while doing a color conversion. Getting CPU access to a hardware rendered buffer may be expensive to begin with, and then color conversion means you are doing a copy. When you finally have that wl_buffer finished and send it to the compositor, the compositor will likely just have to upload it to the GPU again, making another expensive copy. All this hassle and pain is just to get the GUI elements and the video drawn into the same wl_buffer.

Another example is an OpenGL window, or an OpenGL canvas in a window. You definitely do not want to make the GL rendered buffer CPU-accessible, as that can be very expensive. The obvious workaround is to upload your other GUI elements into textures, and combine them with the GL canvas in GL. That could be fairly performant, but it is also very painful to achieve, especially if your toolkit has not been designed to work like that.

A more complex example is a Web browser, where you can have any number of video and GL widgets around the page.

Enter sub-surfaces

Sub-surface is a wl_surface role, that means the surface is an integral sub-part of a window. A sub-surface must always have a parent surface, and the parent surface can have any role. Therefore a window can be constructed from any number of wl_surface objects by choosing one of them to be the main surface which gets a role from the shell, and others are sub-surfaces. Also nesting is allowed, so you can have sub-sub-surfaces etc.

The tree of sub-surfaces starting from the main surface defines a window. The application sets the sub-surface's position on the parent surface, and the compositor will keep the sub-surface glued to the parent. The compositor does not clip sub-surfaces to the parent surface. This means you could implement decorations as four surfaces around the content surface, and compared to one big surface for decorations, you avoid wasting memory for the part that will always be behind the content surface. (This approach may have a visual downside, though.) It also means, that for window management purposes, the size of the window comes from the union of the whole (sub-)surface tree.

In the windowed video player example, the video can be put on a wl_surface of its own, and the decorations into another. If there are sub-titles on top of the video, that could be a third wl_surface. If the compositor accepts the YUV color format the video decoder produces, you can decode straight into a wl_buffer's storage, and attach that wl_buffer to the wl_surface. No more copying or color conversions in the application. When the compositor gets the YUV buffer, it could use GLSL shaders to convert it into RGBA while it composites, or put the buffer into a hardware overlay directly. In the overlay case, the data produced by the (hardware) video decoder gets scanned out on the graphics chip zero-copy! After decoding, the data is not copied or converted even once, which is the optimal path. Of course, in practice there are many implementation details to get right before reaching the optimal path.

Atomicity

Updates to one wl_surface are made atomic with the commit request. A tree of sub-surfaces needs to be updated atomically, too. This is important especially in resizing a window.

A sub-surface's commit request acts specially, when the sub-surface is in synchronized mode. A commit on the sub-wl_surface does not immediately apply the pending surface state, but instead the pending state is cached. The cache is just another copy of the surface state, in addition to the pending and current sets of state. The cached state gets applied when the parent wl_surface gets new state applied (Note: not straight on the parent surface's commit, but when it gets new state applied.) Relying on the cache mechanism, an application can submit new state for the whole tree of surfaces, and then apply it all with a single request: commit on the main surface.

Input handling considerations

When a window has sub-surfaces completely overlapping with its main surface, it is often easiest to set the input region of all sub-surfaces to empty. This will cause all input events to be reported on the main surface, and in the main surface coordinates. Otherwise the input events on a sub-surface are reported in the sub-surface's coordinates.

Independent application sub-modules

A use case than was strongly affecting the design of the sub-surface protocol was application plugin level embedding. An application creates a wl_surface, turns it into a sub-surface, and gives control of that wl_surface to a sub-module or a plugin.

Let us say the plugin is a video sink running in its own thread, and the host application is a Web browser. The browser initializes the video sink and gives it the wl_surface to play on. The video sink decodes the video and pushes frames to the wl_surface. To avoid waking up the browser for every video frame and requiring it to commit on its main surface to let each video frame become visible, the browser can set the sub-surface to desynchronized mode. In desynchronized mode, commits on the sub-surface apply the pending state directly, just like without the sub-surface role. The video sink can run on its own. The browser is still able to control the sub-surface's position on the main surface, glitch-free.

However, resizing gets more complicated, which was also a cause for some criticism. When the browser decides it needs to resize the sub-surface the video sink is using, it sets the sub-surface to synchronized mode temporarily, which means the video on screen stops updating, as all surface state updates now go into the cache. Then the browser signals the new size to the video sink, and the sink acknowledges when it has committed the first buffer with the new size. In the mean time, the browser has repainted its other window parts as needed, and then commits on its main surface. This produces an atomic window update on screen. Finally the browser sets the sub-surface back to the free-running mode. If all goes fast, the result is a glitch-free resize without missing a frame. If things take time, the user still sees a window resize without any flickers, but the video content may freeze for a moment.

Multiple input handlers

It is possible that sub-modules want to handle input on their wl_surfaces, which happen to be sub-surfaces. Sub-modules may even create new wl_surfaces, regardless whether they will be part of the sub-surface tree of a window or not. In such cases, there are a couple of catches.

The first catch is, that when input focus moves to a sub-surface, the input events are given in that surfaces coordinates, like said before.

The bigger catch is how input actually targets surfaces in the client side code. Actual input events for keyboards and pointer devices do not carry the target wl_surface as a parameter. The targeted surface is given by enter events, wl_pointer.enter(surface) for instance. In C code, it means a callback with the following signature gets called:
void pointer_enter(void *data, struct wl_pointer *wl_pointer, uint32_t serial, struct wl_surface *surface, wl_fixed_t surface_x, wl_fixed_t surface_y)
You get a struct wl_surface* saying which surface the following pointer events will target. I assume, that toolkits will call wl_surface_get_user_data(surface) to get a pointer to their internal structure, and then continue with that.

What if the wl_surface is not created by the toolkit to begin with? What if the surface was created by a sub-module, or a sub-module unexpectedly set a non-empty input region on a sub-surface? Then, get_user_data will give you a pointer which points to something else that you thought, and the application likely crashes.

When a toolkit gets an enter event for a surface it does not know about, it must not try to use the user_data pointer. I see two obvious ways to detect such surfaces: maintain a hash table of known wl_surface pointers, or use a magic value in the beginning of the struct used as user_data. Neither is nice, but I do not see a way around it, and this is not limited to sub-surfaces or sub-sub-surfaces. Enter events may refer to any wl_surface objects created through the Wayland connection.

Therefore I would propose the following:
  • Always be prepared to receive an unknown wl_surface on enter and similar events.
  • When writing sub-modules and plugin interfaces, specify whether input is allowed, and whose responsibility is to set the input region to empty.

Out of scope

When I started designing the sub-surface protocol, a huge question was what to leave out of it. The following are not provided by sub-surfaces:
  • Embedding content from other Wayland clients. The sub-surface extension does not implement any "foreign surface" interfaces, or anything like what X allows by just taking the Window XID and passing it to another client to use. The current consensus seems to be that this should be solved by implementing a mini-compositor in the hosting application.
  • Clipping or scaling. The buffer you attach to a sub-surface will decide the size of the sub-surface. There is another extension coming for clipping and scaling.
  • Any kind of message passing between application components. That is better solved in application specific ways.

Summary

Sub-surfaces are intended for special cases, where you need to build a window from several buffers that are composited together, to make efficient use of the hardware resources. They are not meant for widgets in general, nor for pushing parts of application rendering to the compositor. Sub-surfaces are also not meant for things that are not integral parts of a window, like tooltips, menus, or drop-down boxes. These "transient" surface types should be offered by the shell protocol.

Thanks to Collabora, reviewers on wayland-devel@ and in IRC, my work colleagues, and everyone who has helped me with this. Special thanks to Giulio Camuffo for testing the decorations in 4 sub-surfaces use case. I hope I didn't forget anyone.
April 20, 2017

Since the hardware very much matters this is going to be divided into a few parts, the common steps and the hardware specific ones.

Common steps

mkdir /opt/android
repo init -u https://android.googlesource.com/platform/manifest -b android-7.1.1_r28
cd /opt/android/.repo
git clone git://git.collabora.com/git/user/robertfoss/android_manifest.git local_manifests -b etnaviv-android
repo sync -j75

mkdir /opt/imx6_android
cp /opt/imx6_android
git clone git://git.collabora.com/git/user/robertfoss/linux.git -b imx_rdu2_v4.11-rc3

# The mkimage tool is used even if you're not
# using u-boot it as a bootloader
sudo apt install u-boot-tools

# Fetch Kconfig, bootloaders and some scripts
git clone git://git.collabora.com/git/user/robertfoss/rdu2.git .

# This will destroy all data …
April 19, 2017

There exists official documentation for how to create a custom boot animation, but unfortunately it is lacking in actual examples.

So this guide is a bit more hands on.

Structure of bootanimation.zip

Without covering too much of the same gound as the documentation, let's have a quick look at what is in a simple bootanimation.zip.

$ ls -la bootanimation
total 28
drwxr-xr-x 4 hottuna hottuna 4096 Apr 19 22:39 .
drwxr-xr-x 8 hottuna hottuna 4096 Apr 19 22:39 ..
-rw-r--r-- 1 hottuna hottuna   92 Apr 19 15:21 desc.txt
drwxr-xr-x 2 hottuna hottuna 4096 Apr 19 12:44 part0
drwxr-xr-x 2 hottuna hottuna 4096 Apr 19 12:45 part1

$ cat bootanimation/desc.txt 
1920 1080 30         # WIDTH HEIGHT FPS
c 5 15 part0 …

We, at Igalia, have been involved in enabling ARB_gpu_shader_fp64 extension to different Intel generations: first Broadwell and later, then Haswell. Now IvyBridge support is finished and landed Mesa’s master branch.

This feature was the last one to expose OpenGL 4.0 in Intel IvyBridge with the open-source Mesa driver. This is a big achievement for an old hardware generation (IvyBridge was released in 2012), which allows users to run OpenGL 4.0 games/apps on GNU/Linux without overriding the supported version with a Mesa-specific environment variable.

More goods news… ARB_vertex_attrib64 support has landed too, meaning that we are exposing OpenGL 4.2 on Intel Ivybrige!

Technical details

Diving a little bit into technical details (skip this if you are not interested on those)…

This work is standing on top of the shoulders of Intel Haswell support for ARB_gpu_shader_fp64. The latter introduced support of double floating-point (DF) data types on both scalar and vec4 backends which is, in general, very similar to Ivybridge. If you are interested in the technical details about adding ARB_gpu_shader_fp64 to Intel GPUs, see Iago’s talk at last XDC (slides here).

Ivybridge was the first Intel generation that supported double floating-point data types natively. The most important difference bettwen Ivybridge and Haswell is that both execution size and regioning parameters (stride and width) are in terms of 32-bits, so we need to double both regioning parameters and execution size at DF instructions’ emission.

But this is not the only annoyance, there are others quite relevant like:

  • We emit an scalar DF instruction with a maximum execution size of 4 (doubled later to 8) to avoid hitting the gen7’s decompression instruction bug -present also in Haswell- that makes the hardware to read 2 consecutive GRFs regardless the vertical stride. This is specially annoying when reading DF scalars, because the stride is zero -we just want to read data from one GRF- and this bug would make us to read next GRF too; furthermore the hardware applies the same channel enable signals to both halves of the compressed instruction which will be just wrong under non-uniform control flow if force_writemask_all is disabled. However, there is also a physical limitation related to this when using Align16 access mode: SIMD8 is not allowed for DF operations. Also, in order to make DF instructions work under non-uniform control flow, we use NibCtrl to choose the proper flags of the execution mask.

  • 32-bit data types to double (and vice-versa) conversions are quite special. Each 32-bit source data should be aligned 64-bits, so we need to apply an stride on the original data in order to keep this aligment. This is because the FPU internals cannot do the conversion if the data is not aligned to the size of the bigger one. A similar thing happens on converting doubles to 32-bit data types: the output elements need to be 64-bit aligned too.

  • When splitting each DF instruction in two (or more) instructions with an exec_size of 4 as maximum, sometimes it is not so trivial to do and need temporary registers to save the intermediate results before merge them to the real destination.

Due to these things, we needed to improve the d2x lowering pass (now called lower_conversions) which fixes the aforementioned conversions from double floating-point data to 32-bit data types, add some specific fixes in the generator, add code in the validator to detect invalid cases, among other things.

In summary, although Ivybridge is very similar to Haswell, the 64-bit floating point support is quite special and it needs specific code.

Acknowledgements

I would like to thanks Matt Turner and Francisco Jerez from Intel for their insightful reviews, sometimes spotting problems that we did not foresee and their contributions to the patch series. Also, I would like to thanks Juan for his contributions to make this support happen and Igalia for allowing me to work on this amazing open-source project.

Igalia

April 17, 2017

First, a meta update: I’ve moved away from my ancient Livejournal for hosting this blog. LJ recently updated their TOS with a bunch of moderately scary stuff about following Russian law, including requirements to register as a media outlet if your blog gets too popular. While Russian internet censorship probably wouldn’t ever apply to this blog, and it’s unlikely to become popular enough to be a media outlet, I don’t want to support them after this move.

I landed the VC4 V3D fencing code last week. This allows drivers like tinydrm (for the little SPI-attached panels for Raspberry Pi) or PL111 (for my bcm911360 phone) to correctly synchronize display pageflipping to V3D rendering. In the process of writing my V3D code, I found a bug and my reviewers found a cleanup, which I have also submitted for msm and etnaviv.

I also worked on fixing up the NEON tiling code for Mesa 17.1. With what I had landed recently, we weren’t doing runtime detection of NEON being present – I just used NEON if we were on an ARMv7 build. This works for Debian or Fedora’s builds that would support Pi 2/3, but not for Raspbian’s ARMv6 builds that support Raspberry Pi 0-3. I’ve got runtime detection now, and I’m just working on cleaning up all the ARMv6 and ARMv8 cases.

Boris submitted a patch for HDMI runtime power management. This will reduce power consumption (by an amount we haven’t measured) when the HDMI display is off on VC4, but will also result in being able to switch from HDMI to SDTV and back, which was previously broken. I’ll be testing and hopefully landing it this week.

I received an entertaining Mesa patch from the developers of Exagear: They wrote SSE code for V3D’s tiling format, to improve the performance of texture uploads in their x86 emulation environment on Raspberry Pi. We’re still doing patch review (their SSE code doesn’t runtime SSE detection yet), but it should be ready soon. Has anyone experimented with qemu-user for x86 emulation on Raspberry Pis?

Other projects last week: I reviewed and landed the STM32 DRM display driver, tested Meson upstream fixes for the X Server’s build system, reviewed a small performance fix for Mesa’s GL name lookups, and reviewed Gustavo’s patch series to clean up cursor handling on drivers like vc4.

Finally, something that’s not quite vc4-related but that I wanted to highlight: Daniel Vetter posted a patch to document the new code of conduct that applies to DRM kernel development. Many have been pushing for better behavior on our mailing lists (a process in which I’ve been an occasional participant), but until now it’s always been in the form of polite requests. The new patch gives us a bit of a stick for when people refuse to change their behavior. The response was clear, with 22 Reviewed-bys and Acked-bys flooding in in the next few days. This is a huge step forward for our community, and hopefully the patch will land this week.

April 11, 2017

The pl111 driver might be the most satisfying driver submission yet. Not because I’m desperate to see it in tree; I don’t actually have that hardware. Not because it’s bringing really exciting new capabilities. But, from Eric’s v5 submission:

v2: Nearly complete rewrite by anholt, cutting 23 of the code thanks to DRM core’s excellent new helpers.

and a follow-up review:

I must say the driver is really slim and readable with all the new helpers from DRM, good job all who refactored the DRM support for simple framebuffer systems.

The DRM community really has come a long, long, way. Great to see it so thriving and healthy that people are actively dusting off ancient drivers which never got merged, deleting most of them in the process, and getting them in just because the process works so well.

April 10, 2017

On Friday, after consulting with the other freedesktop.org admins, I pushed a change to the freedesktop.org wiki, adding a Code of Conduct, based on the widely-used Contributor Covenant. In doing this, we join pretty much every other large open source project on the planet, with the exception of the Linux kernel, a magnificent anti-pattern. From this point on, all projects hosted on freedesktop.org are subject to this CoC.

why so broad?

fd.o is a notoriously loose collective of communities, who largely run their own affairs. However, freedesktop.org as a project is responsible for the content on our site: mailing list archives, bug tracking systems, website, etc. We’ve already had to intervene to remove legally problematic content from our hosting platforms; ultimately, it is our responsibility.

The culture of our member projects reflect on us as a wider organisation, and the problems of abusive and bullying behaviour weren’t solving themselves. In some specific cases we looked at, we were told directly by senior figures in the project that the lack of a defined fd.o-wide CoC made it harder for them to enforce it themselves.

In the end, the only course of action was completely clear: that we take the same approach to unacceptable behaviour as we do to legally-unacceptable content. Enforcing it across the platform gives everyone complete clarity of what’s required (i.e. behaving like reasonable human beings).

but the honeytrap / bad code

The notion that codes of conduct are used as a kind of submarine device to saddle communities with terrible code, and run completely excellent people out of the community for literally no reason, has been thoroughly debunked over the years that codes of conduct have been implemented. I don’t plan to give these arguments any time at all.

what now?

The conduct mailing list now exists, for confidential reports of any CoC violations. This is currently only manned by fd.o admins. We have, however, been in touch with some of the larger member projects, inviting them to help deal with conduct enforcement in their own projects. This process will take time, but if you’re interested in doing this for your project, please get in touch.

And hopefully, people continue to build healthy communities producing excellent code.

About a week ago there where 2 articles on LWN, the first coverging memory management patch review and the second covering the trouble with making review happen. The take away from these two articles seems to be that review is hard, there’s a constant lack of capable and willing reviewers, and this has been the state of review since forever. I’d like to counter pose this with our experiences in the graphics subsystem, where we’ve rolled out a well-working review process for the Intel driver, core subsystem and now the co-maintained small driver efforts with success, and not all that much pain.

tl;dr: require review, no exceptions, but document your expectations

Aside: This is written with a kernel focus, from the point of view of a maintainer or group of maintainers trying to establish review within their subsystem. But the principles really work anywhere.

Require Review

When review doesn’t happen, that generally means no one regards it as important enough. You can try to improve the situation by highlighting review work more, and giving less focus for top committer stats. But that only goes so far, in the end when you want to make review happen, the one way to get there is to require it.

Of course if that then results in massive screaming, then maybe you need to start with improving the recognition of review and value it more. Trying to put a new process into place over the persistent resistance is not going to work. But as long as there’s general agreement that review is good, this is the easy part.

No Exceptions

The trouble is that there’s a few really easy way to torpedo reviews before you event started, and they’re all around special priviledges and exceptions. From one of the LWN articles:

… requiring reviews might be fair, but there should be one exception: when developers modify their own code.

Another similar exception is often demanded by maintainers for applying their own patches to code they maintain - in the Linux kernel only about 25% of all maintainer patches have any kind of review tag attached when they land. This is in contrast to other contributors, who always have to get past at least their direct maintainer to get a patch applied.

There’s a few reasons why having exceptions for the original developer of some code, or a maintainer of a subsystem, is a really bad idea:

  • Doing review (or at least full review) only for new contributors and people external to the subsystem makes it look like it’s just an elaborate hazing ritual, until you’re welcomed into the inner cabal. That tends to not go down too well with new folks, and not many want to help keep such a system running by actively contributing review. End result is that the pool of reviewers will stay limited to the old guard.

  • Forcing even established contributors to go through review is a great opportunity for them to teach the art of a good review to someone new. Even when you write perfect code, eventually you’ll be gone, and then someone else needs to have understand your code. Not subjecting your own patches to review by others and new contributors drops one of the best mentoring opportunities on the floor we have in open source.

  • And really, unicorns who always write perfect code don’t exist. At least I haven’t seen them yet …

On the flip side, requiring review from all your main contributors is a really easy way to kickstart a working review economy: Instantly you both have a big demand for review. And capable reviewers who are very much willing to trade a bit of review for getting reviews on their own patches.

Another easy pitfall is maintainers who demand unconditional NAck rights for the code they maintain, sometimes spiced up by claiming they don’t even need to provide reasons for the rejection. Of course more experienced people know more about the pitfalls of a code base, and hence are more likely to find serios defects in a change. But most often these rejections aren’t about clear bugs, but more design dogmas once established (and perhaps no longer valid), or just plain personal style preferences. Again, this is a great way to prevent review from happening:

  • Anyone without NAck priviledges can only do second class review, and their review is then of course much less valued. Which means it won’t happen. Note this isn’t about experience, see above, review from new folks can still be useful, but only about who’s been around for longer, or who has the commit powers. Valuing only review from the old guard makes sure you won’t train new reviewers.

  • It also curbs a working review economy, since non-maintainers can’t unblock patches. If only maintainers can provide real review, then you can’t trade reviews. Worse, non-maintainer reviewers might direct the author into a direction that the maintainer doesn’t approve of, wasting everyone’s time.

And again, I haven’t seen unicorns who write perfect code yet, neither have I seen someone who’s review feedback was consistently impeccable.

But Document your Expectations

Training reviews through direct mentoring is great, but it doesn’t scale. Document what you expect from a review as much as possible. This includes everything from coding style, to how much and in which detail code correctness should be checked. But also related things like documentation, test-cases, and process details on how exactly, when and where review is happening.

And like always, executable documentation is much better, hence try to script as much as possible. That’s why build-bots, CI bots, coding style bots, and all these things are possible - it frees the carbon-based reviewers from wasting time on the easy things and instead concentrate on the harder parts of review like code design and overall architecture, and how to best get there from the current code base. But please make sure your scripting and automated testing is of high-quality, because if the results need interpretation by someone experienced you haven’t gained anything. The kernel’s checkpatch.pl coding style checker is a pretty bad example here, since it’s widely accepted that it’s too opinionated and it’s suggestions can’t be blindly followed.

As some examples we have the dim ingloriuos maintainer scripts and some fairly extensive documentation on what is expected from reviewers for Intel graphics driver patches. Contrast that to the comparetively lax review guidelines for small drivers in drm-misc. At Intel we’ve also done internal trainings on review best practices and guidelines. Another big thing we’re working on on the automation front is CI crunching through patchwork series to properly regression test new patches before they land.

April 09, 2017

Audio Shop is a simple script that I cobbled together that gets you started with mangling image data as if it was audio data.

The script wraps 3 individually excellent tools; ffmpeg, ImageMagick and SoX.

Alt text

The way it works is by first converting an image to a raw format like rgb or yuv. This is done to prevent the audio editor from destroying the structure of (relatively) complex formats like jpg, png or gif.

Alt text

If converting to a raw format is the first step, the second step is importing the raw image data into the audio editor. To do this in way you can expect good results from, the raw format should use a bit-depth that your image editor can use. For example RGB, where …

April 06, 2017

Warning! This is not a directly Linux/Tech related blog post (it is not the only thing I care about in this world :).

One thing that has been on my mind for a while is the state of journalism. The quality of journalism seems to have
been declining over the last decade and I think it is clear that the new internet driven expectation that news content
is free for the consumer is a big part of the explanation. We all know that newspapers and TV news teams have seen their
staff cut as advertising revenue has not been strong enough to keep staffing up. And in my opinion advertising is in itself
a horrible way to finance something like news and information as it drivers a lot of unwanted behaviour both in terms of
avoiding critical journalism that might drive away advertisers and an intensive drive for ‘clicks’ per news article that often
comes at the expense of level of accuracy and making news very scandal driven.

So I have come to believe over the last few years that if we want to see quality journalism and a healthy democracy we need
to move away from free news content to accepting that we get what we pay for and start paying for our news again. This is true
both in terms of mainstream media, but also in terms of topical media like tech media.

So as a start I began paying for some of my most use news sites last year. I am now a paying subscriber to the Economist, which is one
of the best sources of quality news in my opinion and on the tech side I am a paying subscriber to Phoronix (I am also a paying subscriber to LWN through work). Anyway, I feel strongly enough about this to write this blog and hope that other people reading it agree with my thinking here and start paying for the content you enjoy, be that through subscriptions or patreons or similar. And maybe we can be part of a process to change the expectation and understanding of the value of well funded independent media. Lets help make news something that is made to help inform us as readers and not something that is made to help someone sell something to us.

So as most of you probably know Mark Shuttleworth just announced that they will be switching to GNOME 3 and Wayland again for Ubuntu. So I would like to on behalf of the Red Hat Desktop and Fedora teams to welcome them and say that we look forward to keep working with great Canonical and Ubuntu people like Allison Lortie and Robert Ancell on projects of shared interest around GNOME, Wayland and hopefully Flatpak.

It is worth mentioning that even as we been competing with Unity and Ubuntu we have also been collaborating with them, most recently on working with them to integrate the features they wanted from GNOME Software like the user reviews, but of course now sharing a bigger set of technologies collaboration will be even easier.

I am personally happy to see this convergence of efforts happening because I have for a long time felt that the general level of investment in the Linux desktop has not been great enough to justify the plethora of Linux desktops out there, so by now having reached a position where Canonical, Endless, Red Hat and Suse again share one desktop technology stack and along with consulting companies like Centricular, CodeThink, Collabora and Igalia helping push parts of the stack forward we are at least all pulling in the same direction.
This change should also make life easier for ISV who now have a more clear target if they want to try to integrate their UI with the Linux desktop as ‘the linux desktop’ becomes a more meaningful term with this change.

And to state the obvious, we will continue our effort around Fedora Workstation to continue to lead on innovation and engineering and setting the direction for where Desktop Linux goes in the future.

April 04, 2017

It’s time for a long-overdue blogpost about the status of Tanglu. Tanglu is a Debian derivative, started in early 2013 when the systemd debate at Debian was still hot. It was formed by a few people wanting to create a Debian derivative for workstations with a time-based release schedule using and showcasing new technologies (which include systemd, but also bundling systems and other things) and built in the open with a community using the similar infrastructure to Debian. Tanglu is designed explicitly to complement Debian and not to compete with it on all devices.

Tanglu has achieved a lot of great things. We were the first Debian derivative to adopt systemd and with the help of our contributors we could kill a few nasty issues affecting it and Debian before it ended up becoming default in Debian Jessie. We also started to use the Calamares installer relatively early, bringing a modern installation experience additionally to the traditional debian-installer. We performed the usrmerge early, uncovering a few more issues which were fed back into Debian to be resolved (while workarounds were added to Tanglu). We also briefly explored switching from initramfs-tools to Dracut, but this release goal was dropped due to issues (but might be revived later). A lot of other less-impactful changes happened as well, borrowing a lot of useful ideas and code from Ubuntu (kudos to them!).

On the infrastructure side, we set up the Debian Archive Kit (dak), managing to find a couple of issues (mostly hardcoded assumptions about Debian) and reporting them back to make using dak for distributions which aren’t Debian easier. We explored using fedmsg for our infrastructure, went through a long and painful iteration of build systems (buildbot -> Jenkins -> Debile) before finally ending up with Debile, and added a set of own custom tools to collect archive QA information and present it to our developers in an easy to digest way. Except for wanna-build, Tanglu is hosting an almost-complete clone of basic Debian archive management tools.

During the past year however, the project’s progress slowed down significantly. For this, mostly I am to blame. One of the biggest challenges for a young project is to attract new developers and members and keep them engaged. A lot of the people coming to Tanglu and being interested in contributing were unfortunately no packagers and sometimes no developers, and we didn’t have the manpower to individually mentor these people and teach them the necessary skills. People asking for tasks were usually asked where their interests were and what they would like to do to give them a useful task. This sounds great in principle, but in practice it is actually not very helpful. A curated list of “junior jobs” is a much better starting point. We also invested almost zero time in making our project known and create the necessary “buzz” and excitement that’s actually needed to sustain a project like this. Doing more in the advertisement domain and “help newcomers” area is a high priority issue in the Tanglu bugtracker, which to the day is still open. Doing good alone isn’t enough, talking about it is of crucial importance and that is something I knew about, but didn’t realize the impact of for quite a while. As strange as it sounds, investing in the tech only isn’t enough, community building is of equal importance.

Regardless of that, Tanglu has members working on the project, but way too few to manage a project of this magnitude (getting package transitions migrated alone is a large task requiring quite some time while at the same time being incredibly boring :P). A lot of our current developers can only invest small amounts of time into the project because they have a lot of other projects as well.

The other issue why Tanglu has problems is too much stuff being centralized on myself. That is a problem I wanted to rectify for a long time, but as soon as a task wasn’t done in Tanglu because no people were available to do it, I completed it. This essentially increased the project’s dependency on me as single person, giving it a really low bus factor. It not only centralizes power in one person (which actually isn’t a problem as long as that person is available enough to perform tasks if asked for), it also centralizes knowledge on how to run services and how to do things. And if you want to give up power, people will need the knowledge on how to perform the specific task first (which they will never gain if there’s always that one guy doing it). I still haven’t found a great way to solve this – it’s a problem that essentially kills itself as soon as the project is big enough, but until then the only way to counter it slightly is to write lots of documentation.

Last year I had way less time to work on Tanglu than the project deserves. I also started to work for Purism on their PureOS Debian derivative (which is heavily influenced by some of the choices we made for Tanglu, but with different focus – that’s probably something for another blogpost). A lot of the stuff I do for Purism duplicates the work I do on Tanglu, and also takes away time I have for the project. Additionally I need to invest a lot more time into other projects such as AppStream and a lot of random other stuff that just needs continuous maintenance and discussion (especially AppStream eats up a lot of time since it became really popular in a lot of places). There is also my MSc thesis in neuroscience that requires attention (and is actually in focus most of the time). All in all, I can’t split myself and KDE’s cloning machine remains broken, so I can’t even use that ;-). In terms of projects there is also a personal hard limit of how much stuff I can handle, and exceeding it long-term is not very healthy, as in these cases I try to satisfy all projects and in the end do not focus enough on any of them, which makes me end up with a lot of half-baked stuff (which helps nobody, and most importantly makes me loose the fun, energy and interest to work on it).

Good news everyone! (sort of)

So, this sounded overly negative, so where does this leave Tanglu? Fact is, I can not commit the crazy amounts of time for it as I did in 2013. But, I love the project and I actually do have some time I can put into it. My work on Purism has an overlap with Tanglu, so Tanglu can actually benefit from the software I develop for them, maybe creating a synergy effect between PureOS and Tanglu. Tanglu is also important to me as a testing environment for future ideas (be it in infrastructure or in the “make bundling nice!” department).

So, what actually is the way forward? First, maybe I have the chance to find a few people willing to work on tasks in Tanglu. It’s a fun project, and I learned a lot while working on it. Tanglu also possesses some unique properties few other Debian derivatives have, like being built from source completely (allowing us things like swapping core components or compiling with more hardening flags, switching to newer KDE Plasma and GNOME faster, etc.). Second, if we do not have enough manpower, I think converting Tanglu into a rolling-release distribution might be the only viable way to keep the project running. A rolling release scheme creates much less effort for us than making releases (especially time-based ones!). That way, users will have a constantly updated and secure Tanglu system with machines doing most of the background work.

If it turns out that absolutely nothing works and we can’t attract new people to help with Tanglu, it would mean that there generally isn’t much interest from the developer or user side in a project like this, so shutting it down or scaling it down dramatically would be the only option. But I do not think that this is the case, and I believe that having Tanglu around is important. I also have some interesting plans for it which will be fun to implement for testing 🙂

The only thing that had to stop is leaving our users in the dark on what is happening.

Sorry for the long post, but there are some subjects which are worth writing more than 140 characters about 🙂

If you are interested in contributing to Tanglu, get in touch with us! We have an IRC channel #tanglu-devel on Freenode (go there for quicker responses!), forums and mailinglists,

It looks like I will be at Debconf this year as well, so you can also catch me there! I might even talk about PureOS/Tanglu infrastructure at the conference.

April 03, 2017

So there is a thread on Hacker News based on a question from a Canonical employee asking for feedback on what people want from the next version of Ubuntu. I always try to read such threads even when they are not about Fedora or Red Hat. I fact I often read such articles and threads about non-Linux systems too to help understand what people are looking for and thus enable us to prioritize what we do with Fedora Workstation even better.

Fedora Workstation

Over the last few years I do feel we managed to nail down what the major pain points are and crossed them out one by one or gotten people assigned to work on them. So a lot of the items people asked for in that thread we already have in Fedora Workstation or have already in our roadmap. So I thought it would be nice to write them up and maybe encourage people to take a look at Fedora Workstation if you haven’t done so already. The list below is my trying to go through the long thread and pick up important and recurring topics, so I hope I got most of them, but if I missed something feel free to add a comment and I will try to answer.

1. Handling of DPI scaling and HiDPI
This has been something we been working on for quite a while. I think we where the first distribution to implemented general HiDPi and put a lot of engineering time into updating Wayland and GNOME to make it happen. That said things are not perfect yet. But we are working on resolving those. Jonas Ådahl and Rui Matos are currently trying to resolve the two main issues we still see. The first item is non-integer UI scaling. Currently we only offer integer scaling meaning that we only offer 2x scaling. This is to much however so we are working on a solution to offer fractional scaling like 1.5 for instance. We are certain to have that ready for Fedora Workstation 27, but there is a small hope we can finalize it already for Fedora Workstation 26. The other item is dealing with applications relying on XWayland because they do not support DPI scaling across two or more monitors, unlike native Wayland applications. We are dealing with that in two ways, one being working with upstreams to get their applications Wayland native like the work we been doing with LibreOffice and Firefox. We are also trying to come up with a scaling solution for XWayland using applications, but we haven’t been able to come up with a solution there yet.

2. Multitouch gestures like 3-finger swipe to change workspace.
This is another item we have put significant effort into. Over the last few years we made sure that we went from almost no touch support in the desktop to now supporting touch throughout the stack. The one big remaining item that was holding items like proper gestures back is that until kernel 4.12 is out the Synaptics touchpads are using PS/2. This causes only 2 touches to be reported, which is not great when you want 3 fingers gestures. With the new kernel, we will be using a different bus for those Synaptics touchpads, and we will have proper 5 fingers support. Benjamin Tissoires on our team has spent 4 years to get this code upstream but it’s finally here. We plan on backport this code to Fedora Workstation 26. Of course application developers will need to make use of the infrastructure in their applications for this feature to be fully realized everywhere.

Multitouch

3. Battery life
This is something we realize is a major issue and it has been on our agenda for a long time. It is a really hard issue to resolve because it is tied into a lot of things outside of control, like hardware used and in some cases third party drivers. That said, as many of you might know we recently set up a Laptop team here inside the bigger Red Hat desktop team and battery life is one of their top priorities. Christian Kellner is our point man on battery life and he has taken over the GNOME Battery bench tool that was originally created by Owen Taylor when we starting looking at battery life. He is currently working on improving GNOME battery bench and talking to hardware vendors to figure out what we can do. We are also actively speaking with NVidia to ensure that we can provide good battery life for hybrid graphics users when the binary NVidia driver is installed. We hope to agree with them on interfaces that should allow us to provide top notch battery life for such systems, but we are beholden to changes in the binary drivers to make that happen so it is also an example of the limits of what we can do on our own here.

GNOME Battery benchGNOME Battery bench

4. UEFI issues
There where people on the Hacker News thread talking about issues with UEFI. Once again this is an area where we have a dedicated engineer assigned to UEFI and making sure it works great. In fact Peter Jones who is our UEFI point man is on the UEFI standards commitee doing ongoing work to ensure the standard is open source friendly and well supported by Linux. It is also worth mentioning that we created the Linux Vendor Firmware Service to make updating UEFI firmware and easy process. So if you see firmware updates offered in GNOME Software for your laptop or other devices that is because of the work we put into this service. We expect to have most of the major vendors signed up by the end of the year, so if your system is currently not supported that is hopefully a temporary thing. So this is both something that works well under Fedora and RHEL due to having someone dedicated to the effort and it is another example us doing the heavy lifting to make things actually happen.

UEFI firmware updatesUEFI Firmware updates

5. We got Wayland
A lot of people in the thread asked about Wayland support and well, we got Wayland! And this is another area where we dedicated serious engineering resources to it and are continuing to do so. For instance in addition to the above mentioned multi-DPI system work we are working on items such as HDR (High dynamic range) and next generation hybrid graphics support in Wayland. We are also working with NVidia to ensure their binary driver works well with Wayland.

Wayland Graphics

6. Something like Redshift
Some time ago we picked up on the growing popularity of tools such as Redshift and f.lux and this was another often repeated request in that Hacker News thread. Well we once again invested our resources into this and thus in the newly released GNOME 3.24 there is built in support for this feature, called Night Light. We drove this feature work and it will of course be available alongside GNOME 3.24 in Fedora Workstation 26.

GNOME Night lightNightlight

7. Improved GPU driver update
This is another item we be spending significant time and resources on. We have a team dedicated to work on the linux graphics stack, which includes people like the graphics subsystem kernel maintainer and RADV creator Dave Airlie, Nouveau maintainer Ben Skeggs, core X, Mesa and Wayland dev Adam Jackson, Freedreeno creator and maintainer Rob Clark and more. This team is pushing the linux graphics stack forward alongside their colleagues at Intel, AMD and NVidia. The one thing we recently been working on for instance is dealing with the NVidia binary driver which has been a pain for a long time due to the file level conflict with Mesa. We didn’t want to do a workaround or hack, so what we did was work with NVidia on their glvnd proposal to make that a reality. This included supporting glvnd in Mesa in addition to the NVidia driver, but also working with the OS level tools to ensure fallbacks and autodetection worked fine. We got the basics of glvnd support already in Fedora and are polishing it up. Hans de Goede who took over that work from Adam Jackson has recently be working with the fine folks at rpmfusion and negativo17 to make sure we have some good packages available taking full advantage of his work, and thus enabling easy install and upgrade of these drivers. We are also planning to start offering a COPR with the latest and greatest Mesa drivers going forward to ensure you can always have the latest drivers available if you want to test and try them out.

NVidia driver installNVidia driver install

8.Improved printer support
Even in this digital work printing is still important and thus we got people dedicated to this task too. So Marek Kasik is working on ensuring we keep CUPS working well and Felipe Borges recently wrote a blog entry talking about the redesigned printer control panel. So this is another area we are spending serious resources on and continuously trying to improve.

New Printer panelNew Printer Panel

9. Improve Bluetooth
Bastien Nocera on our team is probably the person who has done the single most to make sure desktop bluetooth is working at all. We decided to boost that effort by having Christian Kellner work on this too, so he has been working on patches for various bluetooth related issues with Bastien providing guidance and code review. We are also working on coming up with some kind of bluetooth testing harness to allow us to catch regressions more easily and verify support on new hardware. Christian focus currently is improving the handling of Bluetooth Audio.

Summary
If you are contemplating giving Fedora a try I think the items above illustrate one thing very strongly and that is how many of these issues we are the primary force behind, so by using Fedora you are not only getting access to them first and at the same time have some assurance that the integration work has been done right, but you are also supporting the effort of moving these technologies forward and also putting yourself in a position to more directly interact with the engineers working on these and a long slew of other important technologies in the desktop and beyond. And our efforts are not just limited to writing code, like for example our current effort to clear the legal hurdles blocking Linux systems from supporting various media codecs. So if you haven’t already I strongly suggest you go to the Get Fedora website and grab our convenient Fedora installer or an ISO image. And as I said initially if you have other pressing items I didn’t cover here, feel free to post a comment and I will be happy to try to answer any questions I get.