planet.freedesktop.org
October 19, 2017

So I have over the last few years blogged regularly about upcoming features in Fedora Workstation. Well I thought as we putting the finishing touches on Fedora Workstation 27 I should try to look back at everything we have achieved since Fedora Workstation was launched with Fedora 21. The efforts I highlight here are efforts where we have done significant or most development. There are of course a lot of other big changes that has happened over the last few years by the wider community that we leveraged and offer in Fedora Workstation, examples here include things like Meson and Rust. This post is not about those, but that said I do want to write a post just talking about the achievements of the wider community at some point, because they are very important and crucial too. And along the same line this post will not be speaking about the large number of improvements and bugfixes that we contributed to a long list of projects, like to GNOME itself. This blog is about taking stock and taking some pride in what we achieved so far and major hurdles we past on our way to improving the Linux desktop experience.
This blog is also slightly different from my normal format as I will not call out individual developers by name as I usually do, instead I will focus on this being a totality and thus just say ‘we’.

  • Wayland – We been the biggest contributor since we joined the effort and have taken the lead on putting in place all the pieces needed for actually using it on a desktop, including starting to ship it as our primary offering in Fedora Workstation 25. This includes putting a lot of effort into ensuring that XWayland works smoothly to ensure full legacy application support.
  • Libinput – A new library we created for handling all input under both X and Wayland. This came about due to needing input handling that was not tied to X due to Wayland, but it has even improved input handling for X itself. Libinput is being rapidly developed and improved, with 1.9 coming out just a few days ago.
  • glvnd – Dealing with multiple OpenGL implementations have been a pain under Linux for years. We worked with NVidia on this effort to ensure that you can install multiple OpenGL implementations on the system and have your system be able to use the correct one depending on which GPU and driver you are using. We keep expanding on this solution to cover more usecases, so for Fedora Workstation 27 we expect to bring glvnd support to XWayland for instance.
  • Porting Firefox to GTK3 – We ported Firefox to GTK3, including making sure it works under Wayland. This work also provided the foundation for HiDPI support in Firefox. We are the single biggest contributor to Firefox Linux support.
  • Porting LibreOffice to GTK3 – We ported LibreOffice to GTK3, which included Wayland support, touch support and HiDPI support. Our team is one of the major contributors to LibreOffice and help the project forward on a lot of fronts.
  • Google Drive integration – We extended the general Google integration in GNOME 3 to include support for Google Drive as we found that a lot of our users where relying on Google Apps at their work.
  • Flatpak – We created Flatpak to lead the way in moving desktop applications into their own namespaces and containers, resolving a lot of long term challenges for desktop applications on Linux. We expect to have new infrastructure in place in Fedora soon to allow Fedora packagers to quickly and easily turn their applications into Flatpaks.
  • Linux Firmware Service – We created the Linux Firmware service to provide a way for Linux users to get easy access to UEFI firmware on their linux system and worked with great vendors such as Dell and Logitech to get them to support it for their devices. Many bugs experienced by Linux users over the years could have been resolved by firmware updates, but with tooling being spotty many Linux users where not even aware that there was fixes available.
  • GNOME Software – We created GNOME Software to give us a proper Software Store on Fedora and extended it over time to include features such as fonts, GStreamer plugins, GNOME Shell extensions and UEFI firmware updates. Today it is the main Store type application used not just by us, but our work has been adopted by other major distributions too.
  • mp3, ac3 and aac support – We have spent a lot of time to be able to bring support for some of the major audio codecs to Fedora like MP3, AC3 and AAC. In the age of streaming supporting codecs is maybe of less importance than it used to be, but there is still a lot of media on peoples computers they need and want access to.
  • Fedora Media Creator – Cross platform media creator making it very easy to create Fedora Workstation install media regardless of if you are on Windows, Mac or Linux. As we move away from optical media offering ISO downloads started feeling more and more outdated, with the media creator we have given a uniform user experience to quickly create your USB install media, especially important for new users coming in from Windows and Mac environments.
  • Captive portal – We added support for captive portals in Network Manager and GNOME 3, ensuring easy access to the internet over public wifi networks. This feature has been with us for a few years now, but it is still a much appreciated addition.
  • HiDPI support – We worked to add support for HiDPI across X, Wayland, GTK3 and GNOME3. We lead the way on HiDPI support under Linux and keep working on various applications to this date to polish up the support.
  • Touch support – We worked to add support for touchscreens across X, Wayland, GTK3 and GNOME3. We spent significant resources enabling this, both on laptop touchscreens, but also to support modern wacom devices.
  • QGNOME Platform – We created the QGNOME Platform to ensure that Qt applications work well under GNOME3 and gives a nice native and integrated feel. So while we ship GNOME as our desktop offering we want Qt applications to work well and feel native. This is an ongoing effort, but for many important applications it already is a great improvement.
  • Nautilus improvements. Nautilus had been undermaintained for quite a while so we had Carlos Soriano spend significant time on reworking major parts of it and adding new features like renaming multiple files at ones, updating the views and in general bring it up to date.
  • Night light support in GNOME – We added support for automatic adjusting the color and light settings on your system based on light sensors found in modern laptops. This integrated functionality that you before had to install extra software like Red Shift to enable.
  • libratbag – We created a library that enable easy configuration of high end mice and other kind of input devices. This has led to increased collaboration with a lot of gaming mice manufacturers to ensure full support for their devices under Linux.
  • RADV – We created a full open source Vulkan implementation for ADM GPUs which recently got certified as Vulkan compliant. We wanted to give open source Vulkan a boost, so we created the RADV project, which now has an active community around it and is being tested with major games.
  • GNOME Shell performance improvements – We been working on various performance improvements to GNOME Shell over the last few years, with significant improvements having happened. We want to push the envelope on this further though and are planning a major performance hackfest around Shell performance and resource usage early next year.
  • GNOME terminal developer improvements – We worked to improve the features of GNOME Terminal to make it an even better tool for developers with items such as easier naming of terminals and notifications for long running jobs.
  • GNOME Builder – Improving the developer story is crucial for us and we been doing a lot of work to make GNOME Builder a great tool for developer to use to both improve the desktop itself, but also development in general.
  • Pipewire – We created a new media server to unify audio, pro-audio and video. First version which we are shipping in Fedora 27 to handle our video capture.
  • Fleet Commander – We launched Fleet Commander our new tool for managing large Linux desktop deployments. This answer a long standing call from many of Red Hats major desktop customers and many admins of large scale linux deployments at Universities and similar for a powerful yet easy to use administration tool for large desktop deployments.

I am sure I missed something, but this is at least a decent list of Fedora Workstation highlights for the last few years. Next onto working on my Fedora Workstation 27 blogpost :)

October 18, 2017

Alberto Ruiz just announced Fleet Commander as production ready! Fleet Commander is our new tool for managing large deployments of Fedora Workstation and RHEL desktop systems. So get our to Albertos Fleet Commander blog post for all the details.

Something went incredibly right, and review feedback poured in last week and I got to merge a lot of code.

My VC5 GL driver’s patches for core Mesa got reviewed (thanks Rob Clark, Adam Jackson, and Emil Velikov), so I got to merge it to Mesa. It’s so nice to finally be able to work in tree instead of on a rebasing branch that breaks most weeks.

My GL_OES_required_internalformat got reviewed by Nicolai Hähnle, so I gave it another test run on the Intel CI farm (thanks, Mark Janes!) and merged. VC4 and VC5 now have proper 5551 texture format support, and VC4 conformance test failures with 565 are fixed.

My GL_MESA_tile_raster_order extension for overlapping blit support on VC4 got merged to Khronos’s git tree. Nicolai reviewed my Mesa implementation of the extension, so I’ve merged it. All that’s left for that is merging the X Server usage of it and pushing it on downstream to Raspbian.

I tested the fast mutex patch series for Mesa, and found a 4.3% (+/- .9%) improvement in 10x10 copywinwin on my Intel hardware. Hopefully this lands soon, since those performance improvements should show up on ARM as well.

On the VC5 front, I fixed VPM setup on actual HW (the simulator’s restrictions didn’t catch one of the HW requirements), getting a lot of tests that do gl_ModelViewProject * gl_Vertex to work. I played around with the new GPU reset code a bit, and it looks like the next step is to implement binner overflow handling.

I’ve been doing some more review feedback with Boris. We’re getting closer to merge on MADVISE, for sure. I respun my DSI transactions fix based on Boris’s feedback, and it’s even nicer now.

Next week: VC5 binner overflow handling, merging MADVISE, and hopefully putting together some Raspbian backports.

October 12, 2017

So I am really happy to announce another major codec addition to Fedora Workstation 27 namely the addition of the codec called AAC. As you might have seen from Tom Callaways announcement this has just been cleared for inclusion in Fedora.

For those not well versed in the arcane lore of audio codecs AAC is the codec used for things like iTunes and is found in a lot of general media files online. AAC stands for Advanced Audio Coding and was created by the MPEG working group as the successor to mp3. Especially due to Apple embracing the format there is a lot of files out there using it and thus we wanted to support it in Fedora too.

What we will be shipping in Fedora is a modified version of the AAC implementation released by Google, which was originally written by Frauenhoffer. On top of that we will of course be providing GStreamer plugins to enable full support for playing and creating AAC files for GStreamer applications.

Be aware though that AAC is a bit of an umbrella term for a lot of different technologies and thus you might be able to come across files that claims to use AAC, but which we can not play back. The most likely reason for that would be that it requires a AAC profile we do not support. The version of AAC that we will be shipping has also be carefully created to fit within the requirements for software in Fedora, so if you are a packager be aware that unlike with for instance mp3, this change does not mean you can package and ship any AAC implementation you want to in Fedora.

I am expecting to have more major codec announcements soon, so stay tuned :)

October 10, 2017

I spent this week in front of the VC5 hardware, working toward implementing GPU reset. I’m going to need reliable reset before I can start running GL testsuites.

Of course, to believe that GPU reset works, I’m going to want some tests that trigger it. I pulled the VC5 XML-based code-generation from Mesa into i-g-t, and built up basic rendering tests using it. It caught that I was misusing struct scatterlist (causing CPU page faults trying to fill the VC5 MMU). I also had mistyped a bit of the XML, calling a bitmask a bool so that the hardware tried to store render targets 1, 2, and 3, instead of 0 (causing GPU hangs). After stabilizing all that, building the hang testcase was pretty simple.

Taking a break from kernel adventures, I did a bit more work on the vc5 GL driver. Transform feedback now has many tests passing, provoking vertex is supported, float32 render targets are supported, MRTs are supported, colormasks are fixed for non-independent blending, a regression in blending with a 0 factor is fixed, and >32bpp clear colors are fixed.

I’ve got a new revision of Boris’s VC4 MADVISE work, and it’s looking good. Boris has also cleaned up some debug messages that have been spamming dmesg on the Raspberry Pi, which is great news for us.

I also spent quite some time reviewing Dylan’s meson work for Mesa. It’s a slog – build systems for a project of Mesa’s scale are huge, but I’ve seen what meson can get you in productivity gains from my conversion of the X Server, and it looks like we should be able to replace all 3(!) of Mesa’s build systems with this one.

Finally, on Friday I got the reviews necessary for the DSI panel driver, and I merged it to drm-misc-next to appear in the 4.15 kernel. We still need to figure out what to do about the devicetree for it (since it’s sort of an optional piece of hardware for the board, but more official than other hardware), but at least this is a lot less for downstreams to carry.

Next week: merging vc5 GL and working on actually performing GPU reset.

October 08, 2017

TL;DR: systemd now can do per-service IP traffic accounting, as well as access control for IP address ranges.

Last Friday we released systemd 235. I already blogged about its Dynamic User feature in detail, but there's one more piece of new functionality that I think deserves special attention: IP accounting and access control.

Before v235 systemd already provided per-unit resource management hooks for a number of different kinds of resources: consumed CPU time, disk I/O, memory usage and number of tasks. With v235 another kind of resource can be controlled per-unit with systemd: network traffic (specifically IP).

Three new unit file settings have been added in this context:

  1. IPAccounting= is a boolean setting. If enabled for a unit, all IP traffic sent and received by processes associated with it is counted both in terms of bytes and of packets.

  2. IPAddressDeny= takes an IP address prefix (that means: an IP address with a network mask). All traffic from and to this address will be prohibited for processes of the service.

  3. IPAddressAllow= is the matching positive counterpart to IPAddressDeny=. All traffic matching this IP address/network mask combination will be allowed, even if otherwise listed in IPAddressDeny=.

The three options are thin wrappers around kernel functionality introduced with Linux 4.11: the control group eBPF hooks. The actual work is done by the kernel, systemd just provides a number of new settings to configure this facet of it. Note that cgroup/eBPF is unrelated to classic Linux firewalling, i.e. NetFilter/iptables. It's up to you whether you use one or the other, or both in combination (or of course neither).

IP Accounting

Let's have a closer look at the IP accounting logic mentioned above. Let's write a simple unit /etc/systemd/system/ip-accounting-test.service:

[Service]
ExecStart=/usr/bin/ping 8.8.8.8
IPAccounting=yes

This simple unit invokes the ping(8) command to send a series of ICMP/IP ping packets to the IP address 8.8.8.8 (which is the Google DNS server IP; we use it for testing here, since it's easy to remember, reachable everywhere and known to react to ICMP pings; any other IP address responding to pings would be fine to use, too). The IPAccounting= option is used to turn on IP accounting for the unit.

Let's start this service after writing the file. Let's then have a look at the status output of systemctl:

# systemctl daemon-reload
# systemctl start ip-accounting-test
# systemctl status ip-accounting-test
● ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago
 Main PID: 32152 (ping)
       IP: 168B in, 168B out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:05:47 sigma systemd[1]: Started ip-accounting-test.service.
Okt 09 18:05:47 sigma ping[32152]: PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
Okt 09 18:05:47 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=29.2 ms
Okt 09 18:05:48 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=28.0 ms

This shows the ping command running — it's currently at its second ping cycle as we can see in the logs at the end of the output. More interesting however is the IP: line further up showing the current IP byte counters. It currently shows 168 bytes have been received, and 168 bytes have been sent. That the two counters are at the same value is not surprising: ICMP ping requests and responses are supposed to have the same size. Note that this line is shown only if IPAccounting= is turned on for the service, as only then this data is collected.

Let's wait a bit, and invoke systemctl status again:

# systemctl status ip-accounting-test
● ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago
 Main PID: 32152 (ping)
       IP: 22.2K in, 22.2K out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:10:07 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=260 ttl=59 time=27.7 ms
Okt 09 18:10:08 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=261 ttl=59 time=28.0 ms
Okt 09 18:10:09 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=262 ttl=59 time=33.8 ms
Okt 09 18:10:10 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=263 ttl=59 time=48.9 ms
Okt 09 18:10:11 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=264 ttl=59 time=27.2 ms
Okt 09 18:10:12 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=265 ttl=59 time=27.0 ms
Okt 09 18:10:13 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=266 ttl=59 time=26.8 ms
Okt 09 18:10:14 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=267 ttl=59 time=27.4 ms
Okt 09 18:10:15 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=268 ttl=59 time=29.7 ms
Okt 09 18:10:16 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=269 ttl=59 time=27.6 ms

As we can see, after 269 pings the counters are much higher: at 22K.

Note that while systemctl status shows only the byte counters, packet counters are kept as well. Use the low-level systemctl show command to query the current raw values of the in and out packet and byte counters:

# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets
IPIngressBytes=37776
IPIngressPackets=449
IPEgressBytes=37776
IPEgressPackets=449

Of course, the same information is also available via the D-Bus APIs. If you want to process this data further consider talking proper D-Bus, rather than scraping the output of systemctl show.

Now, let's stop the service again:

# systemctl stop ip-accounting-test

When a service with such accounting turned on terminates, a log line about all its consumed resources is written to the logs. Let's check with journalctl:

# journalctl -u ip-accounting-test -n 5
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. --
Okt 09 18:15:50 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=603 ttl=59 time=26.9 ms
Okt 09 18:15:51 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=604 ttl=59 time=27.2 ms
Okt 09 18:15:52 sigma systemd[1]: Stopping ip-accounting-test.service...
Okt 09 18:15:52 sigma systemd[1]: Stopped ip-accounting-test.service.
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic

The last line shown is the interesting one, that shows the accounting data. It's actually a structured log message, and among its metadata fields it contains the more comprehensive raw data:

# journalctl -u ip-accounting-test -n 1 -o verbose
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. --
Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e]
    PRIORITY=6
    _BOOT_ID=4c7e7adcba0c45b69d612857270716d3
    _MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3
    _HOSTNAME=sigma
    SYSLOG_FACILITY=3
    SYSLOG_IDENTIFIER=systemd
    _UID=0
    _GID=0
    _TRANSPORT=journal
    _PID=1
    _COMM=systemd
    _EXE=/usr/lib/systemd/systemd
    _CAP_EFFECTIVE=3fffffffff
    _SYSTEMD_CGROUP=/init.scope
    _SYSTEMD_UNIT=init.scope
    _SYSTEMD_SLICE=-.slice
    CODE_FILE=../src/core/unit.c
    _CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    UNIT=ip-accounting-test.service
    CODE_LINE=2115
    CODE_FUNC=unit_log_resources
    MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
    INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3
    IP_METRIC_INGRESS_BYTES=50880
    IP_METRIC_INGRESS_PACKETS=605
    IP_METRIC_EGRESS_BYTES=50880
    IP_METRIC_EGRESS_PACKETS=605
    MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
    _SOURCE_REALTIME_TIMESTAMP=1507565752649028

The interesting fields of this log message are of course IP_METRIC_INGRESS_BYTES=, IP_METRIC_INGRESS_PACKETS=, IP_METRIC_EGRESS_BYTES=, IP_METRIC_EGRESS_PACKETS= that show the consumed data.

The log message carries a message ID that may be used to quickly search for all such resource log messages (ae8f7b866b0347b9af31fe1c80b127c0). We can combine a search term for messages of this ID with journalctl's -u switch to quickly find out about the resource usage of any invocation of a specific service. Let's try:

# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. --
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic

Of course, the output above shows only one message at the moment, since we started the service only once, but a new one will appear every time you start and stop it again.

The IP accounting logic is also hooked up with systemd-run, which is useful for transiently running a command as systemd service with IP accounting turned on. Let's try it:

# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf
Running as unit: run-u2761.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 878ms
IP traffic received: 231.0K
IP traffic sent: 3.7K

This uses wget to download the PDF version of the 2nd day schedule of everybody's favorite Linux user-space conference All Systems Go! 2017 (BTW, have you already booked your ticket? We are very close to selling out, be quick!). The IP traffic this command generated was 231K ingress and 4K egress. In the systemd-run command line two parameters are important. First of all, we use -p IPAccounting=yes to turn on IP accounting for the transient service (as above). And secondly we use --wait to tell systemd-run to wait for the service to exit. If --wait is used, systemd-run will also show you various statistics about the service that just ran and terminated, including the IP statistics you are seeing if IP accounting has been turned on.

It's fun to combine this sort of IP accounting with interactive transient units. Let's try that:

# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u2779.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# dnf update
…
sh-4.4# dnf install firefox
…
sh-4.4# exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 5.297s
IP traffic received: …B
IP traffic sent: …B

This uses systemd-run's --pty switch (or short: -t), which opens an interactive pseudo-TTY connection to the invoked service process, which is a bourne shell in this case. Doing this means we have a full, comprehensive shell with job control and everything. Since the shell is running as part of a service with IP accounting turned on, all IP traffic we generate or receive will be accounted for. And as soon as we exit the shell, we'll see what it consumed. (For the sake of brevity I actually didn't paste the whole output above, but truncated core parts. Try it out for yourself, if you want to see the output in full.)

Sometimes it might make sense to turn on IP accounting for a unit that is already running. For that, use systemctl set-property foobar.service IPAccounting=yes, which will instantly turn on accounting for it. Note that it won't count retroactively though: only the traffic sent/received after the point in time you turned it on will be collected. You may turn off accounting for the unit with the same command.

Of course, sometimes it's interesting to collect IP accounting data for all services, and turning on IPAccounting=yes in every single unit is cumbersome. To deal with that there's a global option DefaultIPAccounting= available which can be set in /etc/systemd/system.conf.

IP Access Lists

So much about IP accounting. Let's now have a look at IP access control with systemd 235. As mentioned above, the two new unit file settings, IPAddressAllow= and IPAddressDeny= maybe be used for that. They operate in the following way:

  1. If the source address of an incoming packet or the destination address of an outgoing packet matches one of the IP addresses/network masks in the relevant unit's IPAddressAllow= setting then it will be allowed to go through.

  2. Otherwise, if a packet matches an IPAddressDeny= entry configured for the service it is dropped.

  3. If the packet matches neither of the above it is allowed to go through.

Or in other words, IPAddressDeny= implements a blacklist, but IPAddressAllow= takes precedence.

Let's try that out. Let's modify our last example above in order to get a transient service running an interactive shell which has such an access list set:

# systemd-run -p IPAddressDeny=any -p IPAddressAllow=8.8.8.8 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh
Running as unit: run-u2850.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# ping 8.8.8.8 -c1
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=27.9 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms
sh-4.4# ping 8.8.4.4 -c1
PING 8.8.4.4 (8.8.4.4) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 8.8.4.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping 127.0.0.2 -c1
PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms

--- 127.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms
sh-4.4# exit

The access list we set up uses IPAddressDeny=any in order to define an IP white-list: all traffic will be prohibited for the session, except for what is explicitly white-listed. In this command line, we white-listed two address prefixes: 8.8.8.8 (with no explicit network mask, which means the mask with all bits turned on is implied, i.e. /32), and 127.0.0.0/8. Thus, the service can communicate with Google's DNS server and everything on the local loop-back, but nothing else. The commands run in this interactive shell show this: First we try pinging 8.8.8.8 which happily responds. Then, we try to ping 8.8.4.4 (that's Google's other DNS server, but excluded from this white-list), and as we see it is immediately refused with an Operation not permitted error. As last step we ping 127.0.0.2 (which is on the local loop-back), and we see it works fine again, as expected.

In the example above we used IPAddressDeny=any. The any identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it's a shortcut for everything, on both IPv4 and IPv6. A number of other such shortcuts exist. For example, instead of spelling out 127.0.0.0/8 we could also have used the more descriptive shortcut localhost which is expanded to 127.0.0.0/8 ::1/128, i.e. everything on the local loopback device, on both IPv4 and IPv6.

Being able to configure IP access lists individually for each unit is pretty nice already. However, typically one wants to configure this comprehensively, not just for individual units, but for a set of units in one go or even the system as a whole. In systemd, that's possible by making use of .slice units (for those who don't know systemd that well, slice units are a concept for organizing services in hierarchical tree for the purpose of resource management): the IP access list in effect for a unit is the combination of the individual IP access lists configured for the unit itself and those of all slice units it is contained in.

By default, system services are assigned to system.slice, which in turn is a child of the root slice -.slice. Either of these two slice units are hence suitable for locking down all system services at once. If an access list is configured on system.slice it will only apply to system services, however, if configured on -.slice it will apply to all user processes of the system, including all user session processes (i.e. which are by default assigned to user.slice which is a child of -.slice) in addition to the system services.

Let's make use of this:

# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost
# systemctl set-property apache.service IPAddressAllow=10.0.0.0/8

The two commands above are a very powerful way to first turn off all IP communication for all system services (with the exception of loop-back traffic), followed by an explicit white-listing of 10.0.0.0/8 (which could refer to the local company network, you get the idea) but only for the Apache service.

Use-cases

After playing around a bit with this, let's talk about use-cases. Here are a few ideas:

  1. The IP access list logic can in many ways provide a more modern replacement for the venerable TCP Wrapper, but unlike it it applies to all IP sockets of a service unconditionally, and requires no explicit support in any way in the service's code: no patching required. On the other hand, TCP wrappers have a number of features this scheme cannot cover, most importantly systemd's IP access lists operate solely on the level of IP addresses and network masks, there is no way to configure access by DNS name (though quite frankly, that is a very dubious feature anyway, as doing networking — unsecured networking even – in order to restrict networking sounds quite questionable, at least to me).

  2. It can also replace (or augment) some facets of IP firewalling, i.e. Linux NetFilter/iptables. Right now, systemd's access lists are of course a lot more minimal than NetFilter, but they have one major benefit: they understand the service concept, and thus are a lot more context-aware than NetFilter. Classic firewalls, such as NetFilter, derive most service context from the IP port number alone, but we live in a world where IP port numbers are a lot more dynamic than they used to be. As one example, a BitTorrent client or server may use any IP port it likes for its file transfer, and writing IP firewalling rules matching that precisely is hence hard. With the systemd IP access list implementing this is easy: just set the list for your BitTorrent service unit, and all is good.

    Let me stress though that you should be careful when comparing NetFilter with systemd's IP address list logic, it's really like comparing apples and oranges: to start with, the IP address list logic has a clearly local focus, it only knows what a local service is and manages access of it. NetFilter on the other hand may run on border gateways, at a point where the traffic flowing through is pure IP, carrying no information about a systemd unit concept or anything like that.

  3. It's a simple way to lock down distribution/vendor supplied system services by default. For example, if you ship a service that you know never needs to access the network, then simply set IPAddressDeny=any (possibly combined with IPAddressAllow=localhost) for it, and it will live in a very tight networking sand-box it cannot escape from. systemd itself makes use of this for a number of its services by default now. For example, the logging service systemd-journald.service, the login manager systemd-logind or the core-dump processing unit systemd-coredump@.service all have such a rule set out-of-the-box, because we know that neither of these services should be able to access the network, under any circumstances.

  4. Because the IP access list logic can be combined with transient units, it can be used to quickly and effectively sandbox arbitrary commands, and even include them in shell pipelines and such. For example, let's say we don't trust our curl implementation (maybe it got modified locally by a hacker, and phones home?), but want to use it anyway to download the the slides of my most recent casync talk in order to print it, but want to make sure it doesn't connect anywhere except where we tell it to (and to make this even more fun, let's minimize privileges further, by setting DynamicUser=yes):

    # systemd-resolve 0pointer.de
    0pointer.de: 85.214.157.71
                 2a01:238:43ed:c300:10c3:bcf3:3266:da74
    -- Information acquired via protocol DNS in 2.8ms.
    -- Data is authenticated: no
    # systemd-run --pipe -p IPAddressDeny=any \
                         -p IPAddressAllow=85.214.157.71 \
                         -p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \
                         -p DynamicUser=yes \
                         curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
    

So much about use-cases. This is by no means a comprehensive list of what you can do with it, after all both IP accounting and IP access lists are very generic concepts. But I do hope the above inspires your fantasy.

What does that mean for packagers?

IP accounting and IP access control are primarily concepts for the local administrator. However, As suggested above, it's a very good idea to ship services that by design have no network-facing functionality with an access list of IPAddressDeny=any (and possibly IPAddressAllow=localhost), in order to improve the out-of-the-box security of our systems.

An option for security-minded distributions might be a more radical approach: ship the system with -.slice or system.slice configured to IPAddressDeny=any by default, and ask the administrator to punch holes into that for each network facing service with systemctl set-property … IPAddressAllow=…. But of course, that's only an option for distributions willing to break compatibility with what was before.

Notes

A couple of additional notes:

  1. IP accounting and access lists may be mixed with socket activation. In this case, it's a good idea to configure access lists and accounting for both the socket unit that activates and the service unit that is activated, as both units maintain fully separate settings. Note that IP accounting and access lists configured on the socket unit applies to all sockets created on behalf of that unit, and even if these sockets are passed on to the activated services, they will still remain in effect and belong to the socket unit. This also means that IP traffic done on such sockets will be accounted to the socket unit, not the service unit. The fact that IP access lists are maintained separately for the kernel sockets created on behalf of the socket unit and for the kernel sockets created by the service code itself enables some interesting uses. For example, it's possible to set a relatively open access list on the socket unit, but a very restrictive access list on the service unit, thus making the sockets configured through the socket unit the only way in and out of the service.

  2. systemd's IP accounting and access lists apply to IP sockets only, not to sockets of any other address families. That also means that AF_PACKET (i.e. raw) sockets are not covered. This means it's a good idea to combine IP access lists with RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 in order to lock this down.

  3. You may wonder if the per-unit resource log message and systemd-run --wait may also show you details about other types or resources consumed by a service. The answer is yes: if you turn on CPUAccounting= for a service, you'll also see a summary of consumed CPU time in the log message and the command output. And we are planning to hook-up IOAccounting= the same way too, soon.

  4. Note that IP accounting and access lists aren't entirely free. systemd inserts an eBPF program into the IP pipeline to make this functionality work. However, eBPF execution has been optimized for speed in the last kernel versions already, and given that it currently is in the focus of interest to many I'd expect to be optimized even further, so that the cost for enabling these features will be negligible, if it isn't already.

  5. IP accounting is currently not recursive. That means you cannot use a slice unit to join the accounting of multiple units into one. This is something we definitely want to add, but requires some more kernel work first.

  6. You might wonder how the PrivateNetwork= setting relates to IPAccessDeny=any. Superficially they have similar effects: they make the network unavailable to services. However, looking more closely there are a number of differences. PrivateNetwork= is implemented using Linux network name-spaces. As such it entirely detaches all networking of a service from the host, including non-IP networking. It does so by creating a private little environment the service lives in where communication with itself is still allowed though. In addition using the JoinsNamespaceOf= dependency additional services may be added to the same environment, thus permitting communication with each other but not with anything outside of this group. IPAddressAllow= and IPAddressDeny= are much less invasive. First of all they apply to IP networking only, and can match against specific IP addresses. A service running with PrivateNetwork= turned off but IPAddressDeny=any turned on, may enumerate the network interfaces and their IP configured even though it cannot actually do any IP communication. On the other hand if you turn on PrivateNetwork= all network interfaces besides lo disappear. Long story short: depending on your use-case one, the other, both or neither might be suitable for sand-boxing of your service. If possible I'd always turn on both, for best security, and that's what we do for all of systemd's own long-running services.

And that's all for now. Have fun with per-unit IP accounting and access lists!

October 05, 2017

TL;DR: you may now configure systemd to dynamically allocate a UNIX user ID for service processes when it starts them and release it when it stops them. It's pretty secure, mixes well with transient services, socket activated services and service templating.

Today we released systemd 235. Among other improvements this greatly extends the dynamic user logic of systemd. Dynamic users are a powerful but little known concept, supported in its basic form since systemd 232. With this blog story I hope to make it a bit better known.

The UNIX user concept is the most basic and well-understood security concept in POSIX operating systems. It is UNIX/POSIX' primary security concept, the one everybody can agree on, and most security concepts that came after it (such as process capabilities, SELinux and other MACs, user name-spaces, …) in some form or another build on it, extend it or at least interface with it. If you build a Linux kernel with all security features turned off, the user concept is pretty much the one you'll still retain.

Originally, the user concept was introduced to make multi-user systems a reality, i.e. systems enabling multiple human users to share the same system at the same time, cleanly separating their resources and protecting them from each other. The majority of today's UNIX systems don't really use the user concept like that anymore though. Most of today's systems probably have only one actual human user (or even less!), but their user databases (/etc/passwd) list a good number more entries than that. Today, the majority of UNIX users in most environments are system users, i.e. users that are not the technical representation of a human sitting in front of a PC anymore, but the security identity a system service — an executable program — runs as. Even though traditional, simultaneous multi-user systems slowly became less relevant, their ground-breaking basic concept became the cornerstone of UNIX security. The OS is nowadays partitioned into isolated services — and each service runs as its own system user, and thus within its own, minimal security context.

The people behind the Android OS realized the relevance of the UNIX user concept as the primary security concept on UNIX, and took its use even further: on Android not only system services take benefit of the UNIX user concept, but each UI app gets its own, individual user identity too — thus neatly separating app resources from each other, and protecting app processes from each other, too.

Back in the more traditional Linux world things are a bit less advanced in this area. Even though users are the quintessential UNIX security concept, allocation and management of system users is still a pretty limited, raw and static affair. In most cases, RPM or DEB package installation scripts allocate a fixed number of (usually one) system users when you install the package of a service that wants to take benefit of the user concept, and from that point on the system user remains allocated on the system and is never deallocated again, even if the package is later removed again. Most Linux distributions limit the number of system users to 1000 (which isn't particularly a lot). Allocating a system user is hence expensive: the number of available users is limited, and there's no defined way to dispose of them after use. If you make use of system users too liberally, you are very likely to run out of them sooner rather than later.

You may wonder why system users are generally not deallocated when the package that registered them is uninstalled from a system (at least on most distributions). The reason for that is one relevant property of the user concept (you might even want to call this a design flaw): user IDs are sticky to files (and other objects such as IPC objects). If a service running as a specific system user creates a file at some location, and is then terminated and its package and user removed, then the created file still belongs to the numeric ID ("UID") the system user originally got assigned. When the next system user is allocated and — due to ID recycling — happens to get assigned the same numeric ID, then it will also gain access to the file, and that's generally considered a problem, given that the file belonged to a potentially very different service once upon a time, and likely should not be readable or changeable by anything coming after it. Distributions hence tend to avoid UID recycling which means system users remain registered forever on a system after they have been allocated once.

The above is a description of the status quo ante. Let's now focus on what systemd's dynamic user concept brings to the table, to improve the situation.

Introducing Dynamic Users

With systemd dynamic users we hope to make make it easier and cheaper to allocate system users on-the-fly, thus substantially increasing the possible uses of this core UNIX security concept.

If you write a systemd service unit file, you may enable the dynamic user logic for it by setting the DynamicUser= option in its [Service] section to yes. If you do a system user is dynamically allocated the instant the service binary is invoked, and released again when the service terminates. The user is automatically allocated from the UID range 61184–65519, by looking for a so far unused UID.

Now you may wonder, how does this concept deal with the sticky user issue discussed above? In order to counter the problem, two strategies easily come to mind:

  1. Prohibit the service from creating any files/directories or IPC objects

  2. Automatically removing the files/directories or IPC objects the service created when it shuts down.

In systemd we implemented both strategies, but for different parts of the execution environment. Specifically:

  1. Setting DynamicUser=yes implies ProtectSystem=strict and ProtectHome=read-only. These sand-boxing options turn off write access to pretty much the whole OS directory tree, with a few relevant exceptions, such as the API file systems /proc, /sys and so on, as well as /tmp and /var/tmp. (BTW: setting these two options on your regular services that do not use DynamicUser= is a good idea too, as it drastically reduces the exposure of the system to exploited services.)

  2. Setting DynamicUser=yes implies PrivateTmp=yes. This option sets up /tmp and /var/tmp for the service in a way that it gets its own, disconnected version of these directories, that are not shared by other services, and whose life-cycle is bound to the service's own life-cycle. Thus if the service goes down, the user is removed and all its temporary files and directories with it. (BTW: as above, consider setting this option for your regular services that do not use DynamicUser= too, it's a great way to lock things down security-wise.)

  3. Setting DynamicUser=yes implies RemoveIPC=yes. This option ensures that when the service goes down all SysV and POSIX IPC objects (shared memory, message queues, semaphores) owned by the service's user are removed. Thus, the life-cycle of the IPC objects is bound to the life-cycle of the dynamic user and service, too. (BTW: yes, here too, consider using this in your regular services, too!)

With these four settings in effect, services with dynamic users are nicely sand-boxed. They cannot create files or directories, except in /tmp and /var/tmp, where they will be removed automatically when the service shuts down, as will any IPC objects created. Sticky ownership of files/directories and IPC objects is hence dealt with effectively.

The RuntimeDirectory= option may be used to open up a bit the sandbox to external programs. If you set it to a directory name of your choice, it will be created below /run when the service is started, and removed in its entirety when it is terminated. The ownership of the directory is assigned to the service's dynamic user. This way, a dynamic user service can expose API interfaces (AF_UNIX sockets, …) to other services at a well-defined place and again bind the life-cycle of it to the service's own run-time. Example: set RuntimeDirectory=foobar in your service, and watch how a directory /run/foobar appears at the moment you start the service, and disappears the moment you stop it again. (BTW: Much like the other settings discussed above, RuntimeDirectory= may be used outside of the DynamicUser= context too, and is a nice way to run any service with a properly owned, life-cycle-managed run-time directory.)

Persistent Data

Of course, a service running in such an environment (although already very useful for many cases!), has a major limitation: it cannot leave persistent data around it can reuse on a later run. As pretty much the whole OS directory tree is read-only to it, there's simply no place it could put the data that survives from one service invocation to the next.

With systemd 235 this limitation is removed: there are now three new settings: StateDirectory=, LogsDirectory= and CacheDirectory=. In many ways they operate like RuntimeDirectory=, but create sub-directories below /var/lib, /var/log and /var/cache, respectively. There's one major difference beyond that however: directories created that way are persistent, they will survive the run-time cycle of a service, and thus may be used to store data that is supposed to stay around between invocations of the service.

Of course, the obvious question to ask now is: how do these three settings deal with the sticky file ownership problem?

For that we lifted a concept from container managers. Container managers have a very similar problem: each container and the host typically end up using a very similar set of numeric UIDs, and unless user name-spacing is deployed this means that host users might be able to access the data of specific containers that also have a user by the same numeric UID assigned, even though it actually refers to a very different identity in a different context. (Actually, it's even worse than just getting access, due to the existence of setuid file bits, access might translate to privilege elevation.) The way container managers protect the container images from the host (and from each other to some level) is by placing the container trees below a boundary directory, with very restrictive access modes and ownership (0700 and root:root or so). A host user hence cannot take advantage of the files/directories of a container user of the same UID inside of a local container tree, simply because the boundary directory makes it impossible to even reference files in it. After all on UNIX, in order to get access to a specific path you need access to every single component of it.

How is that applied to dynamic user services? Let's say StateDirectory=foobar is set for a service that has DynamicUser= turned off. The instant the service is started, /var/lib/foobar is created as state directory, owned by the service's user and remains in existence when the service is stopped. If the same service now is run with DynamicUser= turned on, the implementation is slightly altered. Instead of a directory /var/lib/foobar a symbolic link by the same path is created (owned by root), pointing to /var/lib/private/foobar (the latter being owned by the service's dynamic user). The /var/lib/private directory is created as boundary directory: it's owned by root:root, and has a restrictive access mode of 0700. Both the symlink and the service's state directory will survive the service's life-cycle, but the state directory will remain, and continues to be owned by the now disposed dynamic UID — however it is protected from other host users (and other services which might get the same dynamic UID assigned due to UID recycling) by the boundary directory.

The obvious question to ask now is: but if the boundary directory prohibits access to the directory from unprivileged processes, how can the service itself which runs under its own dynamic UID access it anyway? This is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system (modulo /tmp and /var/tmp as mentioned above), except for /var/lib/private, which is over-mounted with a read-only tmpfs file system instance, with a slightly more liberal access mode permitting the service read access. Inside of this tmpfs file system instance another mount is placed: a bind mount to the host's real /var/lib/private/foobar directory, onto the same name. Putting this together these means that superficially everything looks the same and is available at the same place on the host and from inside the service, but two important changes have been made: the /var/lib/private boundary directory lost its restrictive character inside the service, and has been emptied of the state directories of any other service, thus making the protection complete. Note that the symlink /var/lib/foobar hides the fact that the boundary directory is used (making it little more than an implementation detail), as the directory is available this way under the same name as it would be if DynamicUser= was not used. Long story short: for the daemon and from the view from the host the indirection through /var/lib/private is mostly transparent.

This logic of course raises another question: what happens to the state directory if a dynamic user service is started with a state directory configured, gets UID X assigned on this first invocation, then terminates and is restarted and now gets UID Y assigned on the second invocation, with X ≠ Y? On the second invocation the directory — and all the files and directories below it — will still be owned by the original UID X so how could the second instance running as Y access it? Our way out is simple: systemd will recursively change the ownership of the directory and everything contained within it to UID Y before invoking the service's executable.

Of course, such recursive ownership changing (chown()ing) of whole directory trees can become expensive (though according to my experiences, IRL and for most services it's much cheaper than you might think), hence in order to optimize behavior in this regard, the allocation of dynamic UIDs has been tweaked in two ways to avoid the necessity to do this expensive operation in most cases: firstly, when a dynamic UID is allocated for a service an allocation loop is employed that starts out with a UID hashed from the service's name. This means a service by the same name is likely to always use the same numeric UID. That means that a stable service name translates into a stable dynamic UID, and that means recursive file ownership adjustments can be skipped (of course, after validation). Secondly, if the configured state directory already exists, and is owned by a suitable currently unused dynamic UID, it's preferably used above everything else, thus maximizing the chance we can avoid the chown()ing. (That all said, ultimately we have to face it, the currently available UID space of 4K+ is very small still, and conflicts are pretty likely sooner or later, thus a chown()ing has to be expected every now and then when this feature is used extensively).

Note that CacheDirectory= and LogsDirectory= work very similar to StateDirectory=. The only difference is that they manage directories below the /var/cache and /var/logs directories, and their boundary directory hence is /var/cache/private and /var/log/private, respectively.

Examples

So, after all this introduction, let's have a look how this all can be put together. Here's a trivial example:

# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
[Service]
ExecStart=/usr/bin/sleep 4711
DynamicUser=yes
EOF
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test
● dynamic-user-test.service
   Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
 Main PID: 2967 (sleep)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/dynamic-user-test.service
           └─2967 /usr/bin/sleep 4711

Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
 2967 sleep           dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user

In this example, we create a unit file with DynamicUser= turned on, start it, check if it's running correctly, have a look at the service process' user (which is named like the service; systemd does this automatically if the service name is suitable as user name, and you didn't configure any user name to use explicitly), stop the service and verify that the user ceased to exist too.

That's already pretty cool. Let's step it up a notch, by doing the same in an interactive transient service (for those who don't know systemd well: a transient service is a service that is defined and started dynamically at run-time, for example via the systemd-run command from the shell. Think: run a service without having to write a unit file first):

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root       root        60  6. Okt 13:21 .
drwxr-xr-x. 1 root       root       852  6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750   8  6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0  6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root  root   66  6. Okt 13:21 .
drwxr-xr-x. 1 root  root  852  6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122   8  6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8  6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test
hello

The above invokes an interactive shell as transient service run-u15750.service (systemd-run picked that name automatically, since we didn't specify anything explicitly) with a dynamic user whose name is derived automatically from the service name. Because StateDirectory=wuff is used, a persistent state directory for the service is made available as /var/lib/wuff. In the interactive shell running inside the service, the ls commands show the /var/lib/private boundary directory and its contents, as well as the symlink that is placed for the service. Finally, before exiting the shell, a file is created in the state directory. Back in the original command shell we check if the user is still allocated: it is not, of course, since the service ceased to exist when we exited the shell and with it the dynamic user associated with it. From the host we check the state directory of the service, with similar commands as we did from inside of it. We see that things are set up pretty much the same way in both cases, except for two things: first of all the user/group of the files is now shown as raw numeric UIDs instead of the user/group names derived from the unit name. That's because the user ceased to exist at this point, and "ls" shows the raw UID for files owned by users that don't exist. Secondly, the access mode of the boundary directory is different: when we look at it from outside of the service it is not readable by anyone but root, when we looked from inside we saw it it being world readable.

Now, let's see how things look if we start another transient service, reusing the state directory from the first invocation:

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
hello
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087  8  6. Okt 13:22 .
drwxr-xr-x. 3 root       root       60  6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087  6  6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit
exit

Here, systemd-run picked a different auto-generated unit name, but the used dynamic UID is still the same, as it was read from the pre-existing state directory, and was otherwise unused. As we can see the test file we generated earlier is accessible and still contains the data we left in there. Do note that the user name is different this time (as it is derived from the unit name, which is different), but the UID it is assigned to is the same one as on the first invocation. We can thus see that the mentioned optimization of the UID allocation logic (i.e. that we start the allocation loop from the UID owner of any existing state directory) took effect, so that no recursive chown()ing was required.

And that's the end of our example, which hopefully illustrated a bit how this concept and implementation works.

Use-cases

Now that we had a look at how to enable this logic for a unit and how it is implemented, let's discuss where this actually could be useful in real life.

  • One major benefit of dynamic user IDs is that running a privilege-separated service leaves no artifacts in the system. A system user is allocated and made use of, but it is discarded automatically in a safe and secure way after use, in a fashion that is safe for later recycling. Thus, quickly invoking a short-lived service for processing some job can be protected properly through a user ID without having to pre-allocate it and without this draining the available UID pool any longer than necessary.

  • In many cases, starting a service no longer requires package-specific preparation. Or in other words, quite often useradd/mkdir/chown/chmod invocations in "post-inst" package scripts, as well as sysusers.d and tmpfiles.d drop-ins become unnecessary, as the DynamicUser= and StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the necessary work automatically, on-demand and with a well-defined life-cycle.

  • By combining dynamic user IDs with the transient unit concept, new creative ways of sand-boxing are made available. For example, let's say you don't trust the correct implementation of the sort command. You can now lock it into a simple, robust, dynamic UID sandbox with a simple systemd-run and still integrate it into a shell pipeline like any other command. Here's an example, showcasing a shell pipeline whose middle element runs as a dynamically on-the-fly allocated UID, that is released when the pipelines ends.

    # cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
    
  • By combining dynamic user IDs with the systemd templating logic it is now possible to do much more fine-grained and fully automatic UID management. For example, let's say you have a template unit file /etc/systemd/system/foobard@.service:

    [Service]
    ExecStart=/usr/bin/myfoobarserviced
    DynamicUser=1
    StateDirectory=foobar/%i
    

    Now, let's say you want to start one instance of this service for each of your customers. All you need to do now for that is:

    # systemctl enable foobard@customerxyz.service --now
    

    And you are done. (Invoke this as many times as you like, each time replacing customerxyz by some customer identifier, you get the idea.)

  • By combining dynamic user IDs with socket activation you may easily implement a system where each incoming connection is served by a process instance running as a different, fresh, newly allocated UID within its own sandbox. Here's an example waldo.socket:

    [Socket]
    ListenStream=2048
    Accept=yes
    

    With a matching waldo@.service:

    [Service]
    ExecStart=-/usr/bin/myservicebinary
    DynamicUser=yes
    

    With the two unit files above, systemd will listen on TCP/IP port 2048, and for each incoming connection invoke a fresh instance of waldo@.service, each time utilizing a different, new, dynamically allocated UID, neatly isolated from any other instance.

  • Dynamic user IDs combine very well with state-less systems, i.e. systems that come up with an unpopulated /etc and /var. A service using dynamic user IDs and the StateDirectory=, CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts will implicitly allocate the users and directories it needs for running, right at the moment where it needs it.

Dynamic users are a very generic concept, hence a multitude of other uses are thinkable; the list above is just supposed to trigger your imagination.

What does this mean for you as a packager?

I am pretty sure that a large number of services shipped with today's distributions could benefit from using DynamicUser= and StateDirectory= (and related settings). It often allows removal of post-inst packaging scripts altogether, as well as any sysusers.d and tmpfiles.d drop-ins by unifying the needed declarations in the unit file itself. Hence, as a packager please consider switching your unit files over. That said, there are a number of conditions where DynamicUser= and StateDirectory= (and friends) cannot or should not be used. To name a few:

  1. Service that need to write to files outside of /run/<package>, /var/lib/<package>, /var/cache/<package>, /var/log/<package>, /var/tmp, /tmp, /dev/shm are generally incompatible with this scheme. This rules out daemons that upgrade the system as one example, as that involves writing to /usr.

  2. Services that maintain a herd of processes with different user IDs. Some SMTP services are like this. If your service has such a super-server design, UID management needs to be done by the super-server itself, which rules out systemd doing its dynamic UID magic for it.

  3. Services which run as root (obviously…) or are otherwise privileged.

  4. Services that need to live in the same mount name-space as the host system (for example, because they want to establish mount points visible system-wide). As mentioned DynamicUser= implies ProtectSystem=, PrivateTmp= and related options, which all require the service to run in its own mount name-space.

  5. Your focus is older distributions, i.e. distributions that do not have systemd 232 (for DynamicUser=) or systemd 235 (for StateDirectory= and friends) yet.

  6. If your distribution's packaging guides don't allow it. Consult your packaging guides, and possibly start a discussion on your distribution's mailing list about this.

Notes

A couple of additional, random notes about the implementation and use of these features:

  1. Do note that allocating or deallocating a dynamic user leaves /etc/passwd untouched. A dynamic user is added into the user database through the glibc NSS module nss-systemd, and this information never hits the disk.

  2. On traditional UNIX systems it was the job of the daemon process itself to drop privileges, while the DynamicUser= concept is designed around the service manager (i.e. systemd) being responsible for that. That said, since v235 there's a way to marry DynamicUser= and such services which want to drop privileges on their own. For that, turn on DynamicUser= and set User= to the user name the service wants to setuid() to. This has the effect that systemd will allocate the dynamic user under the specified name when the service is started. Then, prefix the command line you specify in ExecStart= with a single ! character. If you do, the user is allocated for the service, but the daemon binary is invoked as root instead of the allocated user, under the assumption that the daemon changes its UID on its own the right way. Note that after registration the user will show up instantly in the user database, and is hence resolvable like any other by the daemon process. Example: ExecStart=!/usr/bin/mydaemond

  3. You may wonder why systemd uses the UID range 61184–65519 for its dynamic user allocations (side note: in hexadecimal this reads as 0xEF00–0xFFEF). That's because distributions (specifically Fedora) tend to allocate regular users from below the 60000 range, and we don't want to step into that. We also want to stay away from 65535 and a bit around it, as some of these UIDs have special meanings (65535 is often used as special value for "invalid" or "no" UID, as it is identical to the 16bit value -1; 65534 is generally mapped to the "nobody" user, and is where some kernel subsystems map unmappable UIDs). Finally, we want to stay within the 16bit range. In a user name-spacing world each container tends to have much less than the full 32bit UID range available that Linux kernels theoretically provide. Everybody apparently can agree that a container should at least cover the 16bit range though — already to include a nobody user. (And quite frankly, I am pretty sure assigning 64K UIDs per container is nicely systematic, as the the higher 16bit of the 32bit UID values this way become a container ID, while the lower 16bit become the logical UID within each container, if you still follow what I am babbling here…). And before you ask: no this range cannot be changed right now, it's compiled in. We might change that eventually however.

  4. You might wonder what happens if you already used UIDs from the 61184–65519 range on your system for other purposes. systemd should handle that mostly fine, as long as that usage is properly registered in the user database: when allocating a dynamic user we pick a UID, see if it is currently used somehow, and if yes pick a different one, until we find a free one. Whether a UID is used right now or not is checked through NSS calls. Moreover the IPC object lists are checked to see if there are any objects owned by the UID we are about to pick. This means systemd will avoid using UIDs you have assigned otherwise. Note however that this of course makes the pool of available UIDs smaller, and in the worst cases this means that allocating a dynamic user might fail because there simply are no unused UIDs in the range.

  5. If not specified otherwise the name for a dynamically allocated user is derived from the service name. Not everything that's valid in a service name is valid in a user-name however, and in some cases a randomized name is used instead to deal with this. Often it makes sense to pick the user names to register explicitly. For that use User= and choose whatever you like.

  6. If you pick a user name with User= and combine it with DynamicUser= and the user already exists statically it will be used for the service and the dynamic user logic is automatically disabled. This permits automatic up- and downgrades between static and dynamic UIDs. For example, it provides a nice way to move a system from static to dynamic UIDs in a compatible way: as long as you select the same User= value before and after switching DynamicUser= on, the service will continue to use the statically allocated user if it exists, and only operates in the dynamic mode if it does not. This is useful for other cases as well, for example to adapt a service that normally would use a dynamic user to concepts that require statically assigned UIDs, for example to marry classic UID-based file system quota with such services.

  7. systemd always allocates a pair of dynamic UID and GID at the same time, with the same numeric ID.

  8. If the Linux kernel had a "shiftfs" or similar functionality, i.e. a way to mount an existing directory to a second place, but map the exposed UIDs/GIDs in some way configurable at mount time, this would be excellent for the implementation of StateDirectory= in conjunction with DynamicUser=. It would make the recursive chown()ing step unnecessary, as the host version of the state directory could simply be mounted into a the service's mount name-space, with a shift applied that maps the directory's owner to the services' UID/GID. But I don't have high hopes in this regard, as all work being done in this area appears to be bound to user name-spacing — which is a concept not used here (and I guess one could say user name-spacing is probably more a source of problems than a solution to one, but you are welcome to disagree on that).

And that's all for now. Enjoy your dynamic users!

October 04, 2017
If you take a look at the conformant vulkan list, you might see entry 220.

Software in the Public Interest, Inc. 2017-10-04 Vulkan_1_0 220
AMD Radeon R9 285 Intel i5-4460 x86_64 Linux 4.13 X.org DRI3.

 This is radv, and this is the first conformance submission done under the X.org (SPI) membership of the Khronos adopter program.

This submission was a bit of a trial run for radv developers, but Mesa 17.2 + llvm 5.0 on Bas's R9 285 card.

We can extend this submission to cover all VI GPUs.

In practice we pass all the same tests on CIK and Polaris GPUs, but we will have to do complete submission runs on those when we get a chance.

But major milestone/rubberstamp reached, radv is now a conformant Vulkan driver. Thanks go to Bas and all the other contributors and people who's code we've leveraged!
October 02, 2017

In the previous post in this series I introduced how to render the shadow map image, which is simply the depth information for the scene from the view point of the light. In this post I will cover how to use the shadow map to render shadows.

The general idea is that for each fragment we produce we compute the light space position of the fragment. In this space, the Z component tells us the depth of the fragment from the perspective of the light source. The next step requires to compare this value with the shadow map value for that same X,Y position. If the fragment’s light space Z is larger than the value we read from the shadow map, then it means that this fragment is behind an object that is closer to the light and therefore we can say that it is in the shadows, otherwise we know it receives direct light.

Changes in the shader code

Let’s have a look at the vertex shader changes required for this:

void main()
{
   vec4 pos = vec4(in_position.x, in_position.y, in_position.z, 1.0);
   out_world_pos = Model * pos;
   gl_Position = Projection * View * out_world_pos;

   [...]

   out_light_space_pos = LightViewProjection * out_world_pos;
} 

The vertex shader code above only shows the code relevant to the shadow mapping technique. Model is the model matrix with the spatial transforms for the vertex we are rendering, View and Projection represent the camera’s view and projection matrices and the LightViewProjection represents the product of the light’s view and projection matrices. The variables prefixed with ‘out’ represent vertex shader outputs to the fragment shader.

The code generates the world space position of the vertex (world_pos) and clip space position (gl_Position) as usual, but then also computes the light space position for the vertex (out_light_space_pos) by applying the View and Projection transforms of the light to the world position of the vertex, which gives us the position of the vertex in light space. This will be used in the fragment shader to sample the shadow map.

The fragment shader will need to:

  1. Apply perspective division to compute NDC coordinates from the interpolated light space position of the fragment. Notice that this process is slightly different between OpenGL and Vulkan, since Vulkan’s NDC Z is expected to be in the range [0, 1] instead of OpenGL’s [-1, 1].

  2. Transform the X,Y coordinates from NDC space [-1, 1] to texture space [0, 1].

  3. Sample the shadow map and compare the result with the light space Z position we computed for this fragment to decide if the fragment is shadowed.

The implementation would look something like this:

float
compute_shadow_factor(vec4 light_space_pos, sampler2D shadow_map)
{
   // Convert light space position to NDC
   vec3 light_space_ndc = light_space_pos.xyz /= light_space_pos.w;

   // If the fragment is outside the light's projection then it is outside
   // the light's influence, which means it is in the shadow (notice that
   // such sample would be outside the shadow map image)
   if (abs(light_space_ndc.x) > 1.0 ||
       abs(light_space_ndc.y) > 1.0 ||
       abs(light_space_ndc.z) > 1.0)
      return 0.0;

   // Translate from NDC to shadow map space (Vulkan's Z is already in [0..1])
   vec2 shadow_map_coord = light_space_ndc.xy * 0.5 + 0.5;

   // Check if the sample is in the light or in the shadow
   if (light_space_ndc.z > texture(shadow_map, shadow_map_coord.xy).x)
      return 0.0; // In the shadow

   // In the light
   return 1.0;
}  

The function returns 0.0 if the fragment is in the shadows and 1.0 otherwise. Note that the function also avoids sampling the shadow map for fragments that are outside the light’s frustum (and therefore are not recorded in the shadow map texture): we know that any fragment in this situation is shadowed because it is obviously not visible from the light. This assumption is valid for spotlights and point lights because in these cases the shadow map captures the entire influence area of the light source, for directional lights that affect the entire scene however, we usually need to limit the light’s frustum to the surroundings of the camera, and in that case we probably want want to consider fragments outside the frustum as lighted instead.

Now all that remains in the shader code is to use this factor to eliminate the diffuse and specular components for fragments that are in the shadows. To achieve this we can simply multiply these components by the factor computed by this function.

Changes in the program

The list of changes in the main program are straight forward: we only need to update the pipeline layout and descriptors to attach the new resources required by the shaders, specifically, the light’s view projection matrix in the vertex shader (which could be bound as a push constant buffer or a uniform buffer for example) and the shadow map sampler in the fragment shader.

Binding the light’s ViewProjection matrix is no different from binding the other matrices we need in the shaders so I won’t cover it here. The shadow map sampler doesn’t really have any mysteries either, but since that is new let’s have a look at the code:

...
VkSampler sampler;
VkSamplerCreateInfo sampler_info = {};
sampler_info.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
sampler_info.addressModeU = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.addressModeV = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.addressModeW = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.anisotropyEnable = false;
sampler_info.maxAnisotropy = 1.0f;
sampler_info.borderColor = VK_BORDER_COLOR_INT_OPAQUE_BLACK;
sampler_info.unnormalizedCoordinates = false;
sampler_info.compareEnable = false;
sampler_info.compareOp = VK_COMPARE_OP_ALWAYS;
sampler_info.magFilter = VK_FILTER_LINEAR;
sampler_info.minFilter = VK_FILTER_LINEAR;
sampler_info.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
sampler_info.mipLodBias = 0.0f;
sampler_info.minLod = 0.0f;
sampler_info.maxLod = 100.0f;

VkResult result =
   vkCreateSampler(device, &sampler_info, NULL, &sampler);
...

This creates the sampler object that we will use to sample the shadow map image. The address mode fields are not very relevant since our shader ensures that we do not attempt to sample outside the shadow map, we use linear filtering, but that is not mandatory of course, and we select nearest for the mipmap filter because we don’t have more than one miplevel in the shadow map.

Next we have to bind this sampler to the actual shadow map image. As usual in Vulkan, we do this with a descriptor update. For that we need to create a descriptor of type VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, and then do the update like this:

VkDescriptorImageInfo image_info;
image_info.sampler = sampler;
image_info.imageView = shadow_map_view;
image_info.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;

VkWriteDescriptorSet writes;
writes.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
writes.pNext = NULL;
writes.dstSet = image_descriptor_set;
writes.dstBinding = 0;
writes.dstArrayElement = 0;
writes.descriptorCount = 1;
writes.descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
writes.pBufferInfo = NULL;
writes.pImageInfo = &image_info;
writes.pTexelBufferView = NULL;

vkUpdateDescriptorSets(ctx->device, 1, &writes, 0, NULL);

A combined image sampler brings together the texture image to sample from (a VkImageView of the image actually) and the description of the filtering we want to use to sample that image (a VkSampler). As with all descriptor sets, we need to indicate its binding point in the set (in our case it is 0 because we have a separate descriptor set layout for this that only contains one binding for the combined image sampler).

Notice that we need to specify the layout of the image when it will be sampled from the shaders, which needs to be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
If you revisit the definition of our render pass for the shadow map image, you’ll see that we had it automatically transition the shadow map to this layout at the end of the render pass, so we know the shadow map image will be in this layout immediately after it has been rendered, so we don’t need to add barriers to execute the layout transition manually.

So that’s it, with this we have all the pieces and our scene should be rendering shadows now. Unfortunately, we are not quite done yet, if you look at the results, you will notice a lot of dark noise in surfaces that are directly lit. This is an artifact of shadow mapping called self-shadowing or shadow acne. The next section explains how to get rid of it.

Self-shadowing artifacts

Eliminating self-shadowing

Self-shadowing can happen for fragments on surfaces that are directly lit by a source light for which we are producing a shadow map. The reason for this is that these are the fragments’s Z coordinate in light space should exactly match the value we read from the shadow map for the same X,Y coordinates. In other words, for these fragments we expect:

light_space_ndc.z == texture(shadow_map, shadow_map_coord.xy).x.

However, due to different precession errors that can be generated on both sides of that equation, we may end up with slightly different values for each side and when the value we produce for light_space_ndc.z end ups being larger than what we read from the shadow map, even if it is a very small amount, it will mark the pixel as shadowed, leading to the result we see in that image.

The usual way to fix this problem involves adding a small depth offset or bias to the depth values we store in the shadow map so we ensure that we always read a larger value from the shadow map for the fragment. Another way to think about this is to think that when we record the shadow map, we push every object in the scene slightly away from the light source. Unfortunately, this depth offset bias should not be a constant value, since the angle between the surface normals and the vectors from the light source to the fragments also affects the bias value that we should use to correct the self-shadowing.

Thankfully, GPU hardware provides means to account for this. In Vulkan, when we define the rasterization state of the pipeline we use to create the shadow map, we can add the following:

VkPipelineRasterizationStateCreateInfo rs;
...
rs.depthBiasEnable = VK_TRUE;
rs.depthBiasConstantFactor = 4.0f;
rs.depthBiasSlopeFactor = 1.5f;

Where depthBiasConstantFactor is a constant factor that is automatically added to all depth values produced and depthBiasSlopeFactor is a factor that is used to compute depth offsets also based on the angle. This provides us with the means we need without having to do any extra work in the shaders ourselves to offset the depth values correctly. In OpenGL the same functionality is available via glPolygonOffset().

Notice that the bias values that need to be used to obtain the best results can change for each scene. Also, notice that too big values can lead to shadows that are “detached” from the objects that cast them leading to very unrealistic results. This effect is also known as Peter Panning, and can be observed in this image:

Peter Panning artifacts

As we can see in the image, we no longer have self-shadowing, but now we have the opposite problem: the shadows casted by the red and blue blocks are visibly incorrect, as if they were being rendered further away from the light source than they should be.

If the bias values are chosen carefully, then we should be able to get a good result, although some times we might need to accept some level of visible self-shadowing or visible Peter Panning:

Correct shadowing

The image above shows correct shadowing without any self-shadowing or visible Peter Panning. You may wonder why we can’t see some of the shadows from the red light in the floor where the green light is more intense. The reason is that even though it is not clear because I don’t actually render the objects projecting the lights, the green light is mostly looking down, so its reflection on the floor (that has normals pointing upwards) is strong enough that the contribution from the red light to the floor pixels in this area is insignificant in comparison making the shadows casted from the red light barely visible. You can still see some shadowing if you get close enough with the camera though, I promise 😉

Shadow antialiasing

The images above show aliasing around at the edges of the shadows. This happens because for each fragment we decide if it is shadowed or not as a boolean decision, and we use that result to fully shadow or fully light the pixel, leading to aliasing:

Shadow aliasing

Another thing contributing to the aliasing effect is that a single pixel in the shadow map image can possibly expand to multiple pixels in camera space. That can happen if the camera is looking at an area of the scene that is close to the camera, but far away from the light source for example. In that case, the resolution of that area of the scene in the shadow map is small, but it is large for the camera, meaning that we end up sampling the same pixel from the shadow map to shadow larger areas in the scene as seen by the camera.

Increasing the resolution of the shadow map image will help with this, but it is not a very scalable solution and can quickly become prohibitive. Alternatively, we can implement something called Percentage-Closer Filtering to produce antialiased shadows. The technique is simple: instead of sampling just one texel from the shadow map, we take multiple samples in its neighborhood and average the results to produce shadow factors that do not need to be exactly 1 o 0, but can be somewhere in between, producing smoother transitions for shadowed pixels on the shadow edges. The more samples we take, the smoother the shadows edges get but do note that extra samples per pixel also come with a performance cost.

Smooth shadows with PCF

This is how we can update our compute_shadow_factor() function to add PCF:

float
compute_shadow_factor(vec4 light_space_pos,
                      sampler2D shadow_map,
                      uint shadow_map_size,
                      uint pcf_size)
{
   vec3 light_space_ndc = light_space_pos.xyz /= light_space_pos.w;

   if (abs(light_space_ndc.x) > 1.0 ||
       abs(light_space_ndc.y) > 1.0 ||
       abs(light_space_ndc.z) > 1.0)
      return 0.0;

   vec2 shadow_map_coord = light_space_ndc.xy * 0.5 + 0.5;

   // compute total number of samples to take from the shadow map
   int pcf_size_minus_1 = int(pcf_size - 1);
   float kernel_size = 2.0 * pcf_size_minus_1 + 1.0;
   float num_samples = kernel_size * kernel_size;

   // Counter for the shadow map samples not in the shadow
   float lighted_count = 0.0;

   // Take samples from the shadow map
   float shadow_map_texel_size = 1.0 / shadow_map_size;
   for (int x = -pcf_size_minus_1; x <= pcf_size_minus_1; x++)
   for (int y = -pcf_size_minus_1; y <= pcf_size_minus_1; y++) {
      // Compute coordinate for this PFC sample
      vec2 pcf_coord = shadow_map_coord + vec2(x, y) * shadow_map_texel_size;

      // Check if the sample is in light or in the shadow
      if (light_space_ndc.z <= texture(shadow_map, pcf_coord.xy).x)
         lighted_count += 1.0;
   }

   return lighted_count / num_samples;
}

We now have a loop where we go through the samples in the neighborhood of the texel and average their respective shadow factors. Notice that because we sample the shadow map in texture space [0, 1], we need to consider the size of the shadow map image to properly compute the coordinates for the texels in the neighborhood so the application needs to provide this for every shadow map.

Conclusion

In this post we discussed how to use the shadow map image to produce shadows in the scene as well as typical issues that can show up with the shadow mapping technique, such as self-shadowing and aliasing, and how to correct them. This will be the last post in this series, there is a lot more stuff to cover about lighting and shadowing, such as Cascaded Shadow Maps (which I introduced briefly in this other post), but I think (or I hope) that this series provides enough material to get anyone interested in the technique a reference for how to implement it.

While I was waiting for review feedback I need on two patches (NIR alpha testing and VC5’s core autotools changes to enable the driver), I had some fun filling out more of the GL driver to reduce the piglit failure rate.

I’ve implemented generic tile lists (which allow multicore rendering), fixed base vertex, added base instance, fixed Z updates with discards, added 16-bit depth buffers, fixed depth and stencil clears, fixed interpolation qualifiers on deprecated color inputs, and started fixing the transform feedback support.

On the VC4 front, Boris has been busy working on the MADVISE ioctl. As with other GL drivers, we have a userspace BO cache, so that you don’t have to page fault in new buffer objects in every frame. However, if we call the new madvise ioctl when putting BOs in the cache, the kernel gets permission to free our buffers if it needs to (for example, when CMA has run out of space for a new X11 window you opened). He’s submitted kernel, userspace code, and tests and had some feedback so far.

Maxime has also started up on the vc4 project, working on getting the Chamelium test environment working on the Raspberry Pi’s HDMI output. I’m hoping Chamelium can help us stabilize HDMI in the long term, but for right now the goal is to use it to build tests for displaying VC4’s tiling formats (particularly the SAND tiling format from the media decode engine).

I also got some good news from Raspberry Pi, that Dave Stevenson has VCSM successfully importing dma-bufs. This is a major step for displaying or GL rendering from of the media decode engine’s buffers.

Finally, I took the next step in merging my MESA_tile_raster_order extension: Submitting a pull request to Khronos to add the spec. Hopefully we’ll be able to merge the code for faster uncomposited X11 soon.

September 28, 2017

Last year, the AppStream specification gained proper support for adding metadata for fonts, after Richard Hughes did some work on it years ago. We weren’t happy with how fonts were handled at that time, so we searched for better solutions, which is why this took a bit longer to be done. Last year, I was implementing the final support for fonts in both appstream-generator (the metadata extractor used by Debian and a few others) as well as the AppStream specification. This blogpost was sitting on my todo list as a draft for a long time now, and I only just now managed to finish it, so sorry for announcing this so late. Fonts are already available via AppStream for a year, and this post just sums up the status quo and some neat tricks if you want to write metainfo files for fonts. If you are following AppStream (or the Debian fonts list), you know everything already 🙂 .

Both Richard and I first tried to extract all the metadata to display fonts in a proper way to the users from the font files directly. This turned out to be very difficult, since font metadata is often wrong or incomplete, and certain desirable bits of metadata (like a longer description) are missing entirely. After messing around with different ways to solve this for days (afterall, by extracting the data from font files directly we would have hundreds of fonts directly available in software centers), I also came to the same conclusion as Richard: The best and easiest solution here is to mandate the availability of metainfo files per font.

Which brings me to the second issue: What is a font? For any person knowing about fonts, they will understand one font as one font face, e.g. “Lato Regular Italic” or “Lato Bold”. A user however will see the font family as a font, e.g. just “Lato” instead of all the font faces separated out. Since AppStream data is used primarily by software centers, we want something that is easy for users to understand. Hence, an AppStream “font” components really describes a font family or collection of fonts, instead of individual font faces. We do also want AppStream data to be useful for system components looking for a specific font, which is why font components will advertise the individual font face names they contain via a

<provides/>
 -tag. Naming fonts and making them identifiable is a whole other issue, I used a document from Adobe on font naming issues as a rough guideline while working on this.

How to write a good metainfo file for a font is best shown with an example. Lato is a well-looking font family that we want displayed in a software center. So, we write a metainfo file for it an place it in

/usr/share/metainfo/com.latofonts.Lato.metainfo.xml
  for the AppStream metadata generator to pick up:

<?xml version="1.0" encoding="UTF-8"?>
<component type="font">
  <id>com.latofonts.Lato</id>
  <metadata_license>FSFAP</metadata_license>
  <project_license>OFL-1.1</project_license>

  <name>Lato</name>
  <summary>A sanserif type­face fam­ily</summary>
  <description>
    <p>
      Lato is a sanserif type­face fam­ily designed in the Sum­mer 2010 by Warsaw-based designer
      Łukasz Dziedzic (“Lato” means “Sum­mer” in Pol­ish). In Decem­ber 2010 the Lato fam­ily
      was pub­lished under the open-source Open Font License by his foundry tyPoland, with
      sup­port from Google.
    </p>
  </description>

  <url type="homepage">http://www.latofonts.com/</url>

  <provides>
    <font>Lato Regular</font>
    <font>Lato Black Italic</font>
    <font>Lato Black</font>
    <font>Lato Bold Italic</font>
    <font>Lato Bold</font>
    <font>Lato Hairline Italic</font>
    ...
  </provides>
</component>

When the file is processed, we know that we need to look for fonts in the package it is contained in. So, the appstream-generator will load all the fonts in the package and render example texts for them as an image, so we can show users a preview of the font. It will also use heuristics to render an “icon” for the respective font component using its regular typeface. Of course that is not ideal – what if there are multiple font faces in a package? What if the heuristics fail to detect the right font face to display?

This behavior can be influenced by adding

<font/>
  tags to a
<provides/>
  tag in the metainfo file. The font-provides tags should contain the fullnames of the font faces you want to associate with this font component. If the font file does not define a fullname, the family and style are used instead. That way, someone writing the metainfo file can control which fonts belong to the described component. The metadata generator will also pick the first mentioned font name in the
<provides/>
  list as the one to render the example icon for. It will also sort the example text images in the same order as the fonts are listed in the provides-tag.

The example lines of text are written in a language matching the font using Pango.

But what about symbolic fonts? Or fonts where any heuristic fails? At the moment, we see ugly tofu characters or boxes instead of an actual, useful representation of the font. This brings me to an inofficial extension to font metainfo files, that, as far as I know, only appstream-generator supports at the moment. I am not happy enough with this solution to add it to the real specification, but it serves as a good method to fix up the edge cases where we can not render good example images for fonts. AppStream-Generator supports the FontIconText and FontSampleText custom AppStream properties to allow metainfo file authors to override the default texts and autodetected values. FontIconText will override the characters used to render the icon, while FontSampleText can be a line of text used to render the example images. This is especially useful for symbolic fonts, where the heuristics usually fail and we do not know which glyphs would be representative for a font.

For example, a font with mathematical symbols might want to add the following to its metainfo file:

<custom>
  <value key="FontIconText">∑√</value>
  <value key="FontSampleText">∑ ∮ √ ‖...‖ ⊕ 𝔼 ℕ ⋉</value>
</custom>

Any unicode glyphs are allowed, but asgen will but some length restrictions on the texts.

So, In summary:

  • Fonts are hard
  • I need to blog faster
  • Please add metainfo files to your fonts and submit them upstream if you can!
  • Fonts must have a metainfo file in order to show up in GNOME Software, KDE Discover, AppCenter, etc.
  • The “new” font specification is backwards compatible to Richard’s pioneer work in 2014
  • The appstream-generator supports a few non-standard values to influence how font images are rendered that you might be interested in (maybe we can do something like that for appstream-builder as well)
  • The appstream-generator does not (yet?) support the <extends/> logic Richard outlined in his blog post, mainly because it wasn’t necessary in Debian/Ubuntu/Arch yet (which is asgen’s primary audience), and upstream projects would rarely want to write multiple metainfo files.
  • The metaInfo files are not supposed to replace the existing fontconfig files, and we can not generate them from existing metadata, sadly
  • If you want a more detailed look at writing font metainfo files, take a look at the AppStream specification.
  • Please write more font metadata 😉

 

September 26, 2017

The All Systems Go! 2017 schedule has been published!

I am happy to announce that we have published the All Systems Go! 2017 schedule! We are very happy with the large number and the quality of the submissions we got, and the resulting schedule is exceptionally strong.

Without further ado:

Here's the schedule for the first day (Saturday, 21st of October).

And here's the schedule for the second day (Sunday, 22nd of October).

Here are a couple of keywords from the topics of the talks: 1password, azure, bluetooth, build systems, casync, cgroups, cilium, cockpit, containers, ebpf, flatpak, habitat, IoT, kubernetes, landlock, meson, OCI, rkt, rust, secureboot, skydive, systemd, testing, tor, varlink, virtualization, wifi, and more.

Our speakers are from all across the industry: Chef CoreOS, Covalent, Facebook, Google, Intel, Kinvolk, Microsoft, Mozilla, Pantheon, Pengutronix, Red Hat, SUSE and more.

For further information about All Systems Go! visit our conference web site.

Make sure to buy your ticket for All Systems Go! 2017 now! A limited number of tickets are left at this point, so make sure you get yours before we are all sold out! Find all details here.

See you in Berlin!

Last week I was at the X Developers Conference, where I gave a talk about the vc4 and vc5 projects (video).

Probably the best result of that talk was a developer mentioning to me that the AMDGPU kernel driver’s execbuf model resolved my concerns about the VC5 submit ioctl I was building: All private buffers are implicitly included in an exec, the kernel tracks a list of evicted buffers to be brought back in (probably empty), and it uses a single reservation object for all the non-shared buffers in the address space, so that there are no O(n) walks in the kernel!

As far as code goes, I’m continuing to make progress on the VC5 Vulkan port. This so far involves copying code from the Intel (“anv”) driver trying to get the symbols all implemented, and replacing the core structures I think I need to (batchbuffers and relocation lists turned into a set of command lists and buffers referenced, but without relocations). I’ve also picked a few bits out of the AMD (“radv”) driver, where they’ve done things I need that Intel didn’t have.

Next up in userspace is to get the VC5 code merged (I’m still blocked on review of the NIR alphatest pass, and the patch where I touch common build system stuff to hook the vc5 driver into it), get bcmv to the point where it links (buffer layout is the next big chunk of code to write), finish off state emission, and then start trying to get my first Vulkan triangle rendered.

September 21, 2017

Downloads

If you're curios about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of XDC, and Google for hosting a great community event.

HUION PenTablet devices are graphics tablet devices aimed at artists. These tablets tend to aim for the lower end of the market, driver support is often somewhere between meh and disappointing. The DIGImend project used to take care of them, but with that out of the picture, the bugs bubble up to userspace more often.

The most common bug at the moment is a lack of proximity events. On pen devices like graphics tablets, we expect a BTN_TOOL_PEN event whenever the pen goes in or out of the detectable range of the tablet ('proximity'). On most devices, proximity does not imply touching the surface (that's BTN_TOUCH or a pressure-based threshold), on anything that's not built into a screen proximity without touching the surface is required to position the cursor correctly. libinput relies on proximity events to provide the correct tool state, which again is relied upon by compositors and clients.

The broken HUION devices only send BTN_TOOL_PEN once whenever the pen first goes into proximity and then never again until the device is disconnected. To make things more fun, HUION re-uses USB ids, so we cannot even reliably detect the broken devices and do the usual approach to hardware-quirking. So far, libinput support for HUION devices has thus been spotty. The good news is that libinput git master (and thus libinput 1.9) will have a fix for this. The one thing we can rely on is that tablets keep sending events at the device's scanout frequency. So in libinput we now add a timeout to the tablets and assume proximity-out has happened. libinput fakes a proximity out event and waits for the next event from the tablet - at which point we'll fake a proximity in before processing the events. This is enabled on all HUION devices now (re-using USB IDs, remember?) but not on any other device.

One down, many more broken devices more to go. Yay.

September 20, 2017
It's been a while since I posted about Fedora specific Bluetooth enhancements, and even longer that I posted about PlayStation controllers support.

Let's start with the nice feature.

Dual-Shock 3 and 4 support

We've had support for Dual-Shock 3 (aka Sixaxis, aka PlayStation 3 controllers) for a long while, but I've added a long-standing patchset to the Fedora packages that changes the way devices are setup.

The old way was: plug in your joypad via USB, disconnect it, and press the "P" button on the pad. At this point, and since GNOME 3.12, you would have needed the Bluetooth Settings panel opened for a question to pop up about whether the joypad can connect.

This is broken in a number of ways. If you were trying to just charge the joypad, then it would forget its original "console" and you would need to plug it in again. If you didn't have the Bluetooth panel opened when trying to use it wirelessly, then it just wouldn't have worked.

Set up is now simpler. Open the Bluetooth panel, plug in your device, and answer the question. You just want to charge it? Dismiss the query, or simply don't open the Bluetooth panel, it'll work dandily and won't overwrite the joypad's settings.


And finally, we also made sure that it works with PlayStation 4 controllers.



Note that the PlayStation 4 controller has a button combination that allows it to be visible and pairable, except that if the device trying to connect with it doesn't behave in a particular way (probably the same way the 25€ RRP USB adapter does), it just wouldn't work. And it didn't work for me on a number of different devices.

Cable pairing for the win!

And the boring stuff

Hey, do you know what happened last week? There was a security problem in a package that I glance at sideways sometimes! Yes. Again.

A good way to minimise the problems caused by problems like this one is to lock the program down. In much the same way that you'd want to restrict thumbnailers, or even end-user applications, we can forbid certain functionality from being available when launched via systemd.

We've finally done this in recent fprintd and iio-sensor-proxy upstream releases, as well as for bluez in Fedora Rawhide. If testing goes well, we will integrate this in Fedora 27.
September 19, 2017

In quite a few blog posts I been referencing Pipewire our new Linux infrastructure piece to handle multimedia under Linux better. Well we are finally ready to formally launch pipewire as a project and have created a Pipewire website and logo.Pipewire logo

To give you all some background, Pipewire is the latest creation of GStreamer co-creator Wim Taymans. The original reason it was created was that we realized that as desktop applications would be moving towards primarly being shipped as containerized Flatpaks we would need something for video similar to what PulseAudio was doing for Audio. As part of his job here at Red Hat Wim had already been contributing to PulseAudio for a while, including implementing a new security model for PulseAudio to ensure we could securely have containerized applications output sound through PulseAudio. So he set out to write Pipewire, although initially the name he used was PulseVideo. As he was working on figuring out the core design of PipeWire he came to the conclusion that designing Pipewire to just be able to do video would be a mistake as a major challenge he was familiar with working on GStreamer was how to ensure perfect audio and video syncronisation. If both audio and video could be routed through the same media daemon then ensuring audio and video worked well together would be a lot simpler and frameworks such as GStreamer would need to do a lot less heavy lifting to make it work. So just before we starting sharing the code publicaly we renamed the project to Pinos, named after Pinos de Alhaurín, a small town close to where Wim is living in southern Spain. In retrospect Pinos was probably not the worlds best name :)

Anyway as work progressed Wim decided to also take a look at Jack, as supporting the pro-audio usecase was an area PulseAudio had never tried to do, yet we felt that if we could ensure Pipewire supported the pro-audio usecase in addition to consumer level audio and video it would improve our multimedia infrastructure significantly and ensure pro-audio became a first class citizen on the Linux desktop. Of course as the scope grew the development time got longer too.

Another major usecase for Pipewire for us was that we knew that with the migration to Wayland we would need a new mechanism to handle screen capture as the way it was done under X was very insecure. So Jonas Ådahl started working on creating an API we could support in the compositor and use Pipewire to output. This is meant to cover both single frame capture like screenshot, to local desktop recording and remoting protocols. It is important to note here that what we have done is not just implement support for a specific protocol like RDP or VNC, but we ensured there is an advaned infrastructure in place to support any protocol on top of. For instance we will be working with the Spice team here at Red Hat to ensure SPICE can take advantage of Pipewire and the new API for instance. We will also ensure Chrome and Firefox supports this so that you can share your Wayland desktop through systems such as Blue Jeans.

Where we are now
So after multiple years of development we are now landing Pipewire in Fedora Workstation 27. This initial version is video only as that is the most urgent thing we need supported for Flatpaks and Wayland. So audio is completely unaffected by this for now and rolling that out will require quite a bit of work as we do not want to risk breaking audio on your system as a result of this change. We know that for many the original rollout of PulseAudio was painful and we do not want a repeat of that history.

So I strongly recommend grabbing the Fedora Workstation 27 beta to test pipewire and check out the new website at Pipewire.org and the initial documentation at the Pipewire wiki. Especially interesting is probably the pages that will eventually outline our plans for handling PulseAudio and JACK usecases.

If you are interested in Pipewire please join us on IRC in #pipewire on freenode. Also if things goes as planned Wim will be on Linux Unplugged tonight talking to Chris Fisher and the Unplugged crew about Pipewire, so tune in!

September 14, 2017
Introduction If you were waiting for a new post to learn something new about Intel GEN graphics, I am sorry. I started moving away from i915 kernel development about one year ago. Although I had worked landed patches in mesa before, it has taken me some time to get to a point where I had […]
My next project for Red Hat is to work on improving Linux laptop battery life. Part of the (hopefully) low hanging fruit here is using kernel tunables to
enable more runtime powermanagement. My first target here is SATA Link Power Management (LPM) which, as Matthew Garrett blogged about 2 years ago, can lead to a significant improvement in battery life.

There is only one small problem, there have been some reports that some disks/SSDs don't play well with Linux' min_power LPM policy and that this may lead to system crashes and even data corruption.

Let me repeat this: Enabling SATA LPM may lead to DATA CORRUPTION. So if you want to help with testing this please make sure you have recent backups! Note this happens only in rare cases (likely only with a coupe of specific SSD models with buggy firmware. But still DATA CORRUPTION may happen make sure you have BACKUPS.

As part of his efforts 2 years ago Matthew found this document which describes the LPM policy the Windows Intel Rapid Storage Technology (IRST) drivers use by default and most laptops ship with these drivers installed.

So based on an old patch from Matthew I've written a patch adding support for a new LPM policy called "med_power_with_dipm" to Linux. This saves
(almost) as much power as the min_power setting and since it matches Windows defaults I hope that it won't trip over any SSD/HDD firmware bugs.

So this is where my call for testers comes in, for Fedora 28 we would like to switch to this new SATA LPM policy by default (on laptops at least), but
we need to know that this is safe to do. So we are looking for people to help test this, if you have a laptop with a SATA drive (not NVME) and would like to help please make BACKUPS and then continue reading :)



First of all on a clean Fedora (no powertop --auto-tune, no TLP) do "sudo dnf install powertop", then close all your apps except for 1 terminal, maximimze that terminal and run "sudo powertop".

Now wait 5 minutes, on some laptops the power measurement is a moving average so this is necessary to get a reliable reading. Now look at the
power consumption shown (e.g. 7.95W), watch it for a couple of refreshes as it sometimes spikes when something wakes up to do some work, write down the lowest value you see, this is our base value for your laptops power consumption.

Next install the new kernel and try the new SATA LPM policy. I've done a scratch-build of the Fedora kernel with this patch added, which
you can download here.

After downloading all the .x86_64.rpm files there into a dir, do from this dir:
sudo rpm -ivh kernel*.x86_64.rpm

Next download a rc.local script applying the new settings from here, copy it to /etc/rc.d/rc.local, and make it executable: "sudo chmod +x /etc/rc.d/rc.local".

Now reboot and do: "cat /sys/class/scsi_host/host0/link_power_management_policy" this should return med_power_with_dipm, if not something is wrong.

Then close all your apps except for 1 terminal, maximimze that terminal and run "sudo powertop" again. Wait 5 minutes as last time, then get a couple of readings and write down the lowest value you see.

After this continue using your laptop as normal, please make sure that you keep running the special kernel with the patch adding the "med_power_with_dipm" policy. If after 2 weeks you've not noticed any bad side effects (or if you do notice bad side effects earlier) send me a mail at hdegoede@redhat.com with:

  • Report of success or bad side effects

  • The idle powerconsumption before and after the changes

  • The brand and model of your laptop

  • The output of the following commands:

  • cat /proc/cpuinfo | grep "model name"

  • cat /sys/class/scsi_device/*/device/model

I will gather the results in a table which will be part of the to-be-created Fedora 28 Changes page for this.

Did I mention already that although the chance is small something will go wrong, it is non zero and you should create backups ?

Thank you for your time.
September 12, 2017

So our graphics team is looking for a new Senior Software Engineer to help with our AMD GPU support, including GPU compute. This is a great opportunity to join a versatile and top notch development team who plays a crucial role in making sure Linux has up-to-date and working graphics support and who are deeply involved with most major new developments in Linux graphics.

Also as a piece of advice when you read the job advertisement remember that it is very rare anyone can tick all the boxes in the requirement list, so don’t hesitate to apply just because you don’t fit the description and requirements perfectly. For example even if you are more junior in terms of years you could still be a great candidate if you for instance participated in GPU related Google Summer of Code projects or just as a community contributor. And for this position we are open to candidates from around the globe interested in working as remotees, although as always if you are willing or interested in joining one of our development offices in either Boston-USA, Brisbane-Australia or Brno-Czech Republic that is a plus of course.

So please check out the job advertisement forSenior Software Engineer and see if it could be your chance to join the worlds premier open source company.

September 10, 2017

Downloads

If you're curios about the slides, you can download the PDF or the OTP.

Thanks

This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of Open Source Summit NA, for hosting a great community event.

September 09, 2017
I've been polishing a patch of Marek to enable out-of-order rasterization on VI+. Assuming it goes through as planned, this will be the first time we're adding driver-specific drirc configuration options that are unfamiliar to the enthusiast community (there's radeonsi_enable_sisched already, but Phoronix has reported on the sisched option often enough). So I thought it makes sense to explain what those options are about.

Background: Out-of-order rasterization

Out-of-order rasterization is an optimization that can be enabled in some cases. Understanding it properly requires some background on how tasks are spread across shader engines (SEs) on Radeon GPUs.

The frontends (vertex processing, including tessellation and geometry shaders) and backends (fragment processing, including rasterization and depth and color buffers) are spread across SEs roughly like this:


(Not shown are the compute units (CUs) in each SE, which is where all shaders are actually executed.)

The input assembler distributes primitives (i.e., triangles) and their vertices across SEs in a mostly round-robin fashion for vertex processing. In the backend, work is distributed across SEs by on-screen location, because that improves cache locality.

This means that once the data of a triangle (vertex position and attributes) is complete, most likely the corresponding rasterization work needs to be distributed to other SEs. This is done by what I'm simplifying as the "crossbar" in the diagram.

OpenGL is very precise about the order in which the fixed-function parts of fragment processing should happen. If one triangle comes after another in a vertex buffer and they overlap, then the fragments of the second triangle better overwrite the corresponding fragments of the first triangle (if they weren't rejected by the depth test, of course). This means that the "crossbar" may have to delay forwarding primitives from a shader engine until all earlier primitives (which were processed in another shader engine) have been forwarded. This only happens rarely, but it's still sad when it does.

There are some cases in which the order of fragments doesn't matter. Depth pre-passes are a typical example: the order in which triangles are written to the depth buffer doesn't matter as long as the "front-most" fragments win in the end. Another example are some operations involved in stencil shadows.

Out-of-order rasterization simply means that the "crossbar" does not delay forwarding triangles. Triangles are instead forwarded immediately, which means that they can be rasterized out-of-order. With the in-progress patches, the driver recognizes cases where this optimization can be enabled safely.

By the way #1: From this explanation, you can immediately deduce that this feature only affects GPUs with multiple SEs. So integrated GPUs are not affected, for example.

By the way #2: Out-of-order rasterization is entirely disabled by setting R600_DEBUG=nooutoforder.


Why the configuration options?

There are some cases where the order of fragments almost doesn't matter. It turns out that the most common and basic type of rendering is one of these cases. This is when you're drawing triangles without blending and with a standard depth function like LEQUAL with depth writes enabled. Basically, this is what you learn to do in every first 3D programming tutorial.

In this case, the order of fragments is mostly irrelevant because of the depth test. However, it might happen that two triangles have the exact same depth value, and then the order matters. This is very unlikely in common scenes though. Setting the option radeonsi_assume_no_z_fights=true makes the driver assume that it indeed never happens, which means out-of-order rasterization can be enabled in the most common rendering mode!

Some other cases occur with blending. Some blending modes (though not the most common ones) are commutative in the sense that from a purely mathematical point of view, the end result of blending two triangles together is the same no matter which order they're blended in. Unfortunately, additive blending (which is one of those modes) involves floating point numbers in a way where changing the order of operations can lead to different rounding, which leads to subtly different results. Using out-of-order rasterization would break some of the guarantees the driver has to give for OpenGL conformance.

The option radeonsi_commutative_blend_add=true tells the driver that you don't care about these subtle errors and will lead to out-of-order rasterization being used in some additional cases (though again, those cases are rarer, and many games probably don't encounter them at all).

tl;dr

Out-of-order rasterization can give a very minor boost on multi-shader engine VI+ GPUs (meaning dGPUs, basically) in many games by default. In most games, you should be able to set radeonsi_assume_no_z_fights=true and radeonsi_commutative_blend_add=true to get an additional very minor boost. Those options aren't enabled by default because they can lead to incorrect results.
September 05, 2017

This week I worked on stabilizing the VC5 DRM. I got the MMU support almost working – rendering triangles, but eventually crashing the kernel, and then decided to simplify my plan a bit. First, a bit of background on DRM memory management:

One of the things I’m excited about for VC5 is having each process have a separate address space. It has always felt wrong to me on older desktop chips that we would let any DRI client read/write the contents of other DRI clients’ memory. We just didn’t have any hardware support to help us protect them, without effectively copying out and zeroing the other clients’ on-GPU memory. With i915 we gained page tables that we could swap out, at least, but we didn’t improve the DRM to do this for a long time (I’m actually not entirely sure if we even do so now). One of our concerns was that it could be a cost increase when switching between clients.

However, one of the benefits of having each process have a separate address space is that then we can give the client addresses for its buffer that it can always rely on. Instead of each reference to a buffer needing to be tracked so that kernel (or userspace, with new i915 ABI) can update them when the buffer address changes, we just keep track of the list of all buffers that have been referenced and need to be at their given offsets. This should be a win for CPU overhead.

What I built at first was each VC5 DRM fd having its own page table and address space that it would bind buffers into. The problem was that the page tables are expensive (up to 4MB of contiguous memory), and so I’d need to keep the page tables separate from the address spaces so that I could allocate them on demand.

After a bit of hacking on that plan, I decided to simplify things for now. Given that today’s boards have less than 4 GB of memory, I could have all the clients share the same page tables for a 4GB address space and use the GMP (an 8kb bitmask for 128kb-granularity access control) to mask each client’s access to the address space. With a shared page table and the GMP disabled I soon got this stable enough to do basic piglit texturing tests reliably. The GMP plan should also reduce our context switch overhead, by replacing the small 8kb GMP memory instead of flushing all the MMU caches.

Somewhere down the line, if we find we need 4GB per client, we can build kernel support to have clients that exhaust the shared table get kicked out to their own page tables.

Next up for the kernel module: GPU reset (so I can start running piglit tests and settling my V3D DT binding), and filling out those GMP bitmasks.

The last couple of days of the week were spent on forking the anv driver to start building a VC5 Vulkan driver (“bcmv”). I’ll talk about it soon (next week is my bike trip, so no TWIV then, and after that is XDC 2017).

August 31, 2017

I would like to thank the wonderful organizers, GDG Berlin Android, for hosting a great community event.

Downloads

If you're curios about the slides, you can download the PDF or the OTP.

August 29, 2017

The All Systems Go! 2017 Call for Participation is Closing on September 3rd!

Please make sure to get your presentation proprosals forAll Systems Go! 2017 in now! The CfP closes on sunday!

In case you haven't heard about All Systems Go! yet, here's a quick reminder what kind of conference it is, and why you should attend and speak there:

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward. All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd. All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!

August 28, 2017

This week was fairly focused on getting my BCM7268 board up and running and building the kernel module for it. I first got the board booting to an NFS root on an upstream kernel – huge thanks to the STB team for the upstreaming work they’ve done, so that I could just TFTP load an upstream kernel and start working.

The second half of the week was building the DRM module. I started with copying from vc4, since our execution model is very similar. We have an MMU now, so I replaced the CMA helpers with normal shmem file allocation, and started building bits we’ll need to have a page table per DRM file descriptor. I also ripped out the display components – we’re 3D only now, and any display hardware would be in a separate kernel module and communicate through dma-bufs. Finally, I put together a register header, debugfs register decode, and part of the interrupt handlers. Next week it’s going to be getting our first interrupts handled and filling out the MMU, then I should be on to trying to actually paint some pixels!

August 27, 2017
Occasionally the question comes up about why we convert between various IR's (intermediate representations), like glsl to NIR, in the process of compiling a shader.  Wouldn't it be faster if we just skipped a step and went straight from glsl to "the final thing", which would be ir3 (freedreno), codegen (nouveau), or LLVM (radeonsi/radv).  It is a reasonable question, since most people haven't worked on compilers and we probably haven't done a good job at explaining all the various passes involved in compiling a shader or presenting a breakdown of where the time is spent.

So I spent a bit of time this morning with perf to profile a shader-db run (or rather a subset of a full run to keep the perf.data size manageable, see notes at end).
A flamegraph from the shader-db run, since every blog post needs a catchy picture.

Breakdown:

  • parser, into glsl: 9.98%
  • glsl to nir: 1.3%
  • nir opt/lowering passes: 21.4%
    • CSE: 6.9%
    • opt algebraic: 3.5%
    • conversion to SSA: 2.1%
    • DCE: 2.0%
    • copy propagation: 1.3%
    • other lowering passes: 5.6%
  • nir to ir3: 1.5%
  • ir3 passes:  21.5%
    • register allocation: 5.1%
    • sched: 14.3%
    • other: 2.1%
  • assembly (ir3->binary): 0.66%
This is ignoring some of the fixed overheads of shader-db runner, and also doesn't capture individually a bunch of NIR lowering passes.  NIR has ~40 lowering passes, some that are gl related like nir_lower_draw_pixels and nir_lower_wpos_ytransform (because for hysterical reasons textures and therefore FBO's are upside down in gl).  For gallium drivers using NIR, these gl specific passes are called from mesa state-tracker.

The other lowering passes are not gl specific but tend to be specific to general GPU shader features (ie. things that you wouldn't find in a C compiler for a cpu) and things that are needed by multiple different drivers.  Such as, nir_lower_tex which handles sampling from YUV textures, ie. inserting the instructions to do YUV->RGB conversion (since GLES and android strongly assume this is a thing that hardware can always do), lowering RECT textures, or clamping texture coords.  These lowering passes are called from the driver backend so the driver is in control of what lowering pass are needed, including configuration about individual features in passes which handle multiple things, based on what the hardware does not support directly.

These lowering passes are mostly O(n), and lost in the noise.

Also note that freedreno, along with the other drivers that can consume NIR directly, disable a bunch of opt passes that were originally done in glsl, but that NIR (or LLVM) can do more efficiently.  For freedreno, disabling the glsl opt passes shaved ~30% runtime off of a shader-db run, so spending 1.3% to convert into NIR is way more than offset.

For other drivers, the breakdown may be different.  I expect radeonsi/radv skips some of the general opt passes in NIR which have a counterpart in LLVM, but re-uses other lowering passes which do not have a counterpart in LLVM.


Is it still a gallium driver?


This is a related question that comes up sometimes, is it a gallium driver if it doesn't use TGSI?  Yes.

The drivers that can consume NIR and implement the gallium pipe driver interface, freedreno a3xx+, vc4, vc5, and radeonsi (optionally), are gallium drivers.  They still have to accept TGSI for state trackers which do not support NIR, and various built-in shaders (blits, mipmap generation, etc).  Most use the shared tgsi_to_nir pass for TGSI shaders.  Note that currently tgsi_to_nir does not support all the TGSI features, but just features needed by internal shaders, and what is needed for gl3/gles3 (ie. basically what freedreno and vc4 needed before mesa state-tracker grew support for glsl_to_nir).


Notes:


Collected from shader-db run (glamor + supertuxkart + 0ad shaders) with a debug mesa build (to have debug syms and prevent inlining) but with NIR_VALIDATE=0 (otherwise results with debug builds are highly skewed).  A subset of all shader-db shaders was used to keep the perf.data size manageable.
August 25, 2017

In this last week of my GSoC project I aimed at bringing my code into its final form. The goals for that are simple:

  • Test the code, fix bugs and protocol any remaining issues.
  • Remove any coding style misdemeanors.
  • Since in the end my changes to the Xwayland as well as the Present code became pretty extensive, portion them into multiple patches so they are easier to review.

Regarding the remaining issues in my test scenarios there are still some problems with some of my test applications. As long as the underlying problem isn’t identified the question here is always though, if my code is the problem or the application is not working correctly. For example with KWin’s Wayland session there is a segmentation fault at the end of its run time whenever at some point during run time an Xwayland based client did a flip on a sub-surface. But according to GDB the segmentation fault is caused by KWin’s own egl context destruction and happens somewhere in the depths of the EGL DRI2 implementation in MESA.

I tried to find the reasons for some difficile issues like this one in the last few weeks, but at some point for a few of them I had to admit that either the tools I’m using aren’t capable enough to hint me in the right direction or I’m simply not knowledgeable enough to know where to look at. I hope that after I have sent in the patches in the next few days I get some helpful advice from senior developers who know the code better than I do and I can then solve the remaining problems.

The patches I will send to the mailing list in the next days are dependent on each other. As an overview here is how I planned them:

  1. Introduces an internal API to the Xserver’s Present implementation. The goal of this API is to make it easy to add other flipping modes than just full screen flips. While the previous code for full screen flips needs to be adapted to this API this patch doesn’t yet alter any functionality.
  2. Distribute more code onto several new files to improve readability.
  3. Add a window flipping mode to Present using the new internal API.
  4. Let Xwayland use the window flipping mode.

This partitioning will hopefully help the reviewers. The other advantage is that mistakes I may have made in one of these patches and have been overlooked in the review process might be easier to get found afterwards. Splitting my code changes up into smaller units also gives me the opportunity to look into the structure of my code from a different perspective one more time and fix details I may have overlooked until now.

I hope being able to send the patches in tomorrow. I’m not sure if I’m supposed to write one more post next week after the GSoC project has officially ended. But in any case I plan on writing one more post at some point in the future about the reaction to my patches and if and how they got merged in the end.

August 23, 2017
This is the last entry in the Google Summer of Code series that I have been writing weekly for the last three months. It is different from the usual updates in that I won’t be discussing development progress: rather, this will be the submission report for the project as a whole. I’ll be discussing the why? behind the project, the plan that my mentor and I came up with to execute the project, the work I have done over the summer including a video of the result, the things that are left to work on, what I’ve learned during the project and finally, the links to the code that I have written for the actual submission.
Well, this really is the last week of my Summer of Code! This will be a short update on the final changes made since my last update on Friday, which will be followed by a proper blog post for my final submission later this week. Last week I mentioned a few items that were work-in-progress. The search field for the button mapping dialog has been merged, including its revamp of the dialog’s UI.
August 21, 2017

This week I finished fixing regressions from the VC5 QPU instruction scheduler, and polished the vc5 series up so I could post it for review in Mesa. I landed a few initial bits that wouldn’t affect anyone else.

I also took another look at my DSI series. I had previously tried to work around the boot time dependency circle by letting the DSI device be created before the DSI host showed up. That proved to be fragile as well (in particular it would have had issues if the host had to -EPROBE_DEFER for some other reason), so I went back and made VC4 advertise a DSI host before any of the rest of the DSI encoder was ready. This proved to be not very hard, and I’m hoping once again that this is the final version of the series.

Other bits this week:

  • Fixed tile layout for 64bpp textures on VC5.
  • Fixed DSI transactions sleeping in the IRQ handler.
  • Did some investigation of HW behavior for upstreaming Pi3 BT
  • Sent a second set of PRs for bcm2835 features in 4.14
  • Landed NEON build support for ARMv6 and Jonas’s ARM64 port
  • Landed HDMI EDID leak fix
  • Landed X Server scissoring optimizations for VC4.
  • Fixed and resubmitted GL_OES_required_internalformat support (part of vc4 GLES2 DEQP fixes)
August 18, 2017

The last week of GSoC 2017 is about to begin. My project is in a pretty good state I would say: I have created a big solution for the Xwayland Present support, which is integrated firmly and not just attached to the main code path like an afterthought. But there are still some issues to sort out. Especially the correct cleanup of objects is difficult. That’s only a problem with sub-surfaces though. So, if I’m not able to solve these issues in the next few days I’ll just allow full window flips. This would still include all full screen windows and for example also the Steam client as it’s directly rendering its full windows without usage of the compositor.

I still hope though to solve the last problems with the sub-surfaces, since this would mean that we can in all direct rendering cases on Wayland use buffer flips, which would be a huge improvement compared to native X.

In any case at first I’ll send the final patch for the Present extension to the xorg-devel mailing list. This patch will add a separate mode for flips per window to the extension code. After that I’ll send the patches for Xwayland, either with or without sub-surface support.

That’s already all for now as a last update before the end and with still a big decision to be made. In one week I can report back on what I chose and how the final code looks like.

This week I have been solving more issues to make sure that Piper offers a pleasent user experience, doesn’t crash and runs smoothly. My mentor and cvuchener from the libratbag project have been testing Piper the last week, and together with a handful of users attracted to Piper they have opened a bunch of issues for me to solve. Let’s run through the most visible ones! Solving global warming Probably the most fun issue to resolve this week was the one reported by my mentor: Piper contributes to global warming (you won’t believe what happened next!
August 16, 2017
The indoor garden at Collège de Maisonneuve, the DebConf 17 venue
Decorative photo of the indoor garden

I'm currently at DebConf 17 in Montréal, back at DebConf for the first time in 10 years (last time was DebConf 7 in Edinburgh). It's great to put names to faces and meet more of my co-developers in person!

On Monday I gave a talk entitled “A Debian maintainer's guide to Flatpak”, aiming to introduce Debian developers to Flatpak, and show how Flatpak and Debian (and Debian derivatives like SteamOS) can help each other. It seems to have been quite well received, with people generally positive about the idea of using Flatpak to deliver backports and faster-moving leaf packages (games!) onto the stable base platform that Debian is so good at providing.

A video of the talk is available from the Debian Meetings Archive. I've also put up my slides in the DebConf git-annex repository, with some small edits to link to more source code: A Debian maintainer's guide to Flatpak. Source code for the slides is also available from Collabora's git server.

The next step is to take my proof-of-concept for building Flatpak runtimes and apps from Debian and SteamOS packages, flatdeb, get it a bit more production-ready, and perhaps start publishing some sample runtimes from a cron job on a Debian or Collabora server. (By the way, if you downloaded that source right after my talk, please update - I've now pushed some late changes that were necessary to fix the 3D drivers for my OpenArena demo.)

I don't think Debian will be going quite as far as Endless any time soon: as Cosimo outlined in the talk right before mine, they deploy their Debian derivative as an immutable base OS with libOSTree, with all the user-installable modules above that coming from Flatpak. That model is certainly an interesting thing to think about for Debian derivatives, though: at Collabora we work on a lot of appliance-like embedded Debian derivatives, with a lot of flexibility during development but very limited state on deployed systems, and Endless' approach seems a perfect fit for those situations.

[Edited 2017-08-16 to fix the link for the slides, and add links for the video]

August 14, 2017
I recently acquired an r7 360 (BONAIRE) and spent some time getting radv stable and passing the same set of conformance tests that VI and Polaris pass.

The main missing thing was 10-bit integer format clamping for a bug in the SI/CIK fragment shader output hardware, where it truncates instead of clamps. The other missing piece was code for handling f16->f32 conversions according to the vulkan spec that I'd previously fixed for VI.

I also looked at a trace from amdgpu-pro and noticed it was using a ds_swizzle for the derivative calculations which avoided accessing LDS memory. I wrote support to use this path for radv/radeonsi since LLVM supported the intrinsic for a while now.

With these fixed CIK is pretty much in the same place as VI/Polaris.

I then plugged in my SI (Tahiti), and got lots of GPU hangs and crashes. I fixed a number of SI specific bugs (tiling and MSAA handling, stencil tiling). However even with those fixed I was getting random hangs, and a bunch of people on a bugzilla had noticed the same thing. I eventually discovered adding a shader pipeline and cache flush at the end of every command buffer (this took a few days to narrow down exactly). We aren't 100% sure why this is required on SI only, it may be a kernel bug, or a command processor bug, but it does mean radv on SI now can run games without hanging.

There are still a few CTS tests outstanding on SI only, and I'll probably get to them eventually, however I also got an RX Vega and once I get a newer BIOS for it from AMD I shall be spending some time fixing the radv support for it.

This week was spent almost entirely on the VC5 QPU instruction scheduler. This is what packs together the ADD and MUL instruction components and the signal flags (LDUNIF, LDVPM, LDTMU, THRSW, etc.) into the final sequence of 64-bit instructions.

I based it on the VC4 scheduler, which in turn is based on i965’s. Being the 5th of these I’ve worked on, it would sure be nice if it felt like less copy and paste, but it’s almost all very machine-dependent code and I haven’t come up with a way to reduce the duplication.

The initial results are great:

instructions in affected programs:     116269 -> 71677 (-38.35%)

but I’ve still got a handful of regressions to fix.

Next up for scheduling is to fill the thrsw and branch delay slots. I also need to pack more than 2 things into one QPU instruction – right we pick two of ADD, MUL, LDUNIF, and LDVPM, but we could do an ADD, MUL, LDVPM, and LDUNIF all together. That should be a small change from here.

Other bits this week:

  • Fixed a leak of the EDID data from the HDMI connector (~256 bytes per xrandr -q invocation)
  • Reviewed modifier blob support from daniels for better atomic-modesetting weston on vc4
  • Pushed the RCL walk order DRM support.
  • Reworked patches for turning on vc4 NEON support in Raspbian builds, to fix Android builds.
August 11, 2017

One more time I decided to start from the beginning and try another even more radical approach to my Xwayland GSoC project than the last time. I have now basically written a full API inside the Present extension, with which modes of presentation can be added. There are of course only two modes right now: The default screen presenting mode as how it worked until now and the new one for Xwayland to present on individual windows and without the need of them being full screen. While this was also possible with the version from last week, the code is now substantially better structured.

I’m currently still in a phase of testing so I won’t write much more about it for now. Instead I want to talk about one very persistent bug, which popped up seemingly from one day to the other in my KWin session and hindered my work immensely.

This is also a tale of how difficult it can be to find the exact reason for a bug. Especially when there is not much information to work with: As said the problem came out of nowhere. I had used Neverball to test my Xwayland code in the past all the time. But starting a few days ago whenever I selected a level and the camera was to pan down to the level the whole KWin session blocked and I could only hard reboot the whole computer or SIGKILL the KWin process via SSH. The image of the level was frozen and keyboard inputs didn’t work anymore. That said I still heard the Neverball music playing in the background, so the application wasn’t the problem. And Xwayland or KWin didn’t quit with a segfault, they just stopped doing stuff.

So I began the search. Of course I first suspected my own code to be the problem. But when I tried the Xwayland master branch I experienced the same problem. But please, what was that? Why suddenly didn’t Neverball work at all anymore? I had used it in the past all the time, but now everything blocks? So I tried first to roll back commits in the master branches of Xwayland, KWin, KWayland in the last few weeks, thinking that the problem must have been introduced at that point in time because I could still use Neverball just recently without problems, right?

But the problem was still there. So I went further back. It worked with my distribution’s Xwayland and when manually testing through the input related commits to Xwayland starting from somewhere at the beginning of the year I finally found the responsible commit, or so I thought. But yeah, before that commit no blocking, with it blocking, so there has to be an error with this particular commit, right? But on the other side why could I use Neverball just one week ago without blocking and this commit is several months old?

Nevertheless I studied the Xwayland input code thoroughly. The documentation for this stuff is non-existent and the function patterns confusing, but with time I understood it well enough to decide that this couldn’t be the cause of the problem. Another indicator for that was, that Weston worked. The problem had to be in KWin or KWayland somewhere. After lots of time I also finally understood somewhat why I still could use Neverball a few days ago but now not at all anymore: I always started KWin from the terminal before that without launching a full Wayland Plasma session. But after here everything worked fine I switched to testing it in the Plasma session and apparently missed that from now on the problem existed. So was it Plasma itself? But this wasn’t really possible, since Plasma is a different process.

I didn’t want to give up and so I looked through the KWayland and KWin code related to pointer locking and confinement, which is a lot. Hours later I finally found the root cause: KWin creates small on screen notifications when a pointer is locked or confined to a window. Most of the time this works without problem, but with the above patch to Xwayland the client sends in quick succession the pointer confine and lock requests to KWin and for some reason when trying to show both notifications at the same time KWin or maybe the QML engine for the notification can’t process any further. Without the patch Xwayland always only sent the confinement request and nothing blocked. I don’t know how Martin would like to have this issue solved so I created a bug report for now. It’s weird that it was such a petty cause in the end with such huge consequences, but that’s how it goes.

Last week I shared the news that all large features had been implemented; all that was left were issues raised by GNOME contributors when my mentor demoed my progress over GUADEC. This week I’ve just been chugging away on those issues; let me show you. Obvious save button is.. not obvious? Hadess noted that the save button wasn’t obvious when my mentor demoed Piper to him; I already mentioned this last week: the icon is a save to disk icon and it’s not obvious at first glance that you have to press it to write the changes you made to the device.
August 09, 2017

The All Systems Go! 2017 Headline Speakers Announced!

Don't forget to send in your submissions to the All Systems Go! 2017 CfP! Proposals are accepted until September 3rd!

A couple of headline speakers have been announced now:

  • Alban Crequy (Kinvolk)
  • Brian "Redbeard" Harrington (CoreOS)
  • Gianluca Borello (Sysdig)
  • Jon Boulle (NStack/CoreOS)
  • Martin Pitt (Debian)
  • Thomas Graf (covalent.io/Cilium)
  • Vincent Batts (Red Hat/OCI)
  • (and yours truly)

These folks will also review your submissions as part of the papers committee!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

August 08, 2017

A while back at the awesome maintainerati I chatted with a few great fellow maintainers about how to scale really big open source projects, and how github forces projects into a certain way of scaling. The linux kernel has an entirely different model, which maintainers hosting their projects on github don’t understand, and I think it’s worth explaining why and how it works, and how it’s different.

Another motivation to finally get around to typing this all up is the HN discussion on my “Maintainers Don’t Scale” talk, where the top comment boils down to “… why don’t these dinosaurs use modern dev tooling?”. A few top kernel maintainers vigorously defend mailing lists and patch submissions over something like github pull requests, but at least some folks from the graphics subsystem would love more modern tooling which would be much easier to script. The problem is that github doesn’t support the way the linux kernel scales out to a huge number of contributors, and therefore we can’t simply move, not even just a few subsystems. And this isn’t about just hosting the git data, that part obviously works, but how pull requests, issues and forks work on github.

Scaling, the Github Way

Git is awesome, because everyone can fork and create branches and hack on the code very easily. And eventually you have something good, and you create a pull request for the main repo and get it reviewed, tested and merged. And github is awesome, because it figured out an UI that makes this complex stuff all nice&easy to discover and learn about, and so makes it a lot simpler for new folks to contribute to a project.

But eventually a project becomes a massive success, and no amount of tagging, labelling, sorting, bot-herding and automating will be able to keep on top of all the pull requests and issues in a repository, and it’s time to split things up into more manageable pieces again. More important, with a certain size and age of a project different parts need different rules and processes: The shiny new experimental library has different stability and CI criteria than the main code, and maybe you have some dumpster pile of deprecated plugins that aren’t support, but you can’t yet delete them: You need to split up your humongous project into sub-projects, each with their own flavour of process and merge criteria and their own repo with their own pull request and issue tracking. Generally it takes a few tens to few hundreds of full time contributors until the pain is big enough that such a huge reorganization is necessary.

Almost all projects hosted on github do this by splitting up their monorepo source tree into lots of different projects, each with its distinct set of functionality. Usually that results in a bunch of things that are considered the core, plus piles of plugins and libraries and extensions. All tied together with some kind of plugin or package manager, which in some cases directly fetches stuff from github repos.

Since almost every big project works like this I don’t think it’s necessary to delve on the benefits. But I’d like to highlight some of the issues this is causing:

  • Your community fragments more than necessary. Most contributors just have the code and repos around that they directly contribute to, and ignore everything else. That’s great for them, but makes it much less likely that duplicated effort and parallel solutions between different plugins and libraries get noticed and the efforts shared. And people who want to steward the overall community need to deal with the hassle of tons of repos either managed through some script, or git submodules, or something worse, plus they get drowned in pull requests and issues by being subscribed to everything. Any kind of concern (maybe you have shared build tooling, or documentation, or whatever) that doesn’t neatly align with your repo splits but cuts across the project becomes painful for the maintainers responsible for that.

  • Even once you’ve noticed the need for it, refactoring and code sharing have more bureaucratic hurdles: First you have to release a new version of the core library, then go through all the plugins and update them, and then maybe you can remove the old code in the shared library. But since everything is massively spread around you can forget about that last step.

    Of course it’s not much work to do this, and many projects excel at making this fairly easy. But it is more effort than a simple pull request to the one single monorepo. Very simple refactorings (like just sharing a single new function) will happen less often, and over a long time that compounds and accumulates a lot of debt. Except when you go the node.js way with repos for single functions, but then you essentially replace git with npm as your source control system, and that seems somewhat silly too.

  • The combinatorial explosion of theoretically supported version mixes becomes unsupportable. As a user that means you end up having to do the integration testing. As a project you’ll end up with blessed combinations, or at least de-facto blessed combinations because developers just close bug reports with “please upgrade everything first”. Again that means defacto you have a monorepo, except once more it’s not managed in git. Well, except if you use submodules, and I’m not sure that’s considered git …

  • Reorganizing how you split the overall projects into sub-projects is a pain, since it means you need to reorganize your git repositories and how they’re split up. In a monorepo shifting around maintainership just amounts to updating OWNER or MAINTAINERS files, and if your bots are all good the new maintainers get auto-tagged automatically. But if your way of scaling means splitting git repos into disjoint sets, then any reorg is as painful as the initial step from a monorepo to a group of split up repositories. That means your project will be stuck with a bad organizational structure for too long.

Interlude: Why Pull Requests Exist

The linux kernel is one of the few projects I’m aware of which isn’t split up like this. Before we look at how that works - the kernel is a huge project and simply can’t be run without some sub-project structure - I think it’s interesting to look at why git does pull requests: On github pull request is the one true way for contributors to get their changes merged. But in the kernel changes are submitted as patches sent to mailing lists, even long after git has been widely adopted.

But the very first version of git supported pull requests. The audience of these first, rather rough, releases was kernel maintainers, git was written to solve Linus Torvalds’ maintainer problems. Clearly it was needed and useful, but not to handle changes from individual contributors: Even today, and much more back then, pull requests are used to forward the changes of an entire subsystem, or synchronize code refactoring or similar cross-cutting change across different sub-projects. As an example, the 4.12 network pull request from Dave S. Miller, committed by Linus: It contains 2k+ commits from 600 contributors and a bunch of merges for pull requests from subordinate maintainers. But almost all the patches themselves are committed by maintainers after picking up the patches from mailing lists, not by the authors themselves. This kernel process peculiarity that authors generally don’t commit into shared repositories is also why git tracks the committer and author separately.

Github’s innovation and improvement was then to use pull requests for everything, down to individual contributions. But that wasn’t what they were originally created for.

Scaling, the Linux Kernel Way

At first glance the kernel looks like a monorepo, with everything smashed into one place in Linus’ main repo. But that’s very far from it:

  • Almost no one who’s using linux is running the main repo from Linus Torvalds. If they run something upstream-ish it’s probably one of the stable kernels. But much more likely is that they run a kernel from their distro, which usually has additional patches and backports, and isn’t even hosted on kernel.org, so would be a completely different organization. Or they have a kernel from their hardware vendor (for SoC and pretty much anything Android), which often have considerable deltas compared to anything hosted in one of the “main” repositories.

  • No one (except Linus himself) is developing stuff on top of Linus’ repository. Every subsystem, and often even big drivers, have their own git repositories, with their own mailing lists to track submissions and discuss issues completely separate from everyone else.

  • Cross-subsystem work is done on top of the linux-next integration tree, which contains a few hundred git branches from about as many different git repositories.

  • All this madness is managed through the MAINTAINERS file and the get_maintainers.pl script, which for any given snippet of code can tell you who’s the maintainer, who should review this, where the right git repo is, which mailing lists to use and how and where to report bugs. And it’s not just strictly based on file locations, it also catches code patterns to make sure that cross-subsystem topics like device-tree handling, or the kobject hierarchy are handled by the right experts.

At first this just looks like a complicated way to fill everyone’s disk space with lots of stuff they don’t care about, but there’s a pile of compounding minor benefits that add up:

  • It’s dead easy to reorganize how you split things into sub-project, just update the MAINTAINERS file and you’re done. It’s a bit more work than it really needs to be, since you might need to create a new repo, new mailing lists and a new bugzilla. That’s just an UI problem that github solved with this neat little fork button.

  • It’s really really easy to reassign discussions on pull requests and issues between sub-projects, you simply adjust the Cc: list on your reply. Similarly, doing cross-subsystem work is much easier to coordinate, since the same pull request can be submitted to multiple sub-projects, and there’s just one overall discussions (since the Msg-Ids: tags used for mailing list threading are the same for everyone), despite that the mails are archived in a pile of different mailing list archives, go through different mailing lists and land in a few thousand different inboxes. Making it easier to discuss topics and code across sub-projects avoids fragmentation and makes it much easier to spot where code sharing and refactoring would be beneficial.

  • Cross-subsystem work doesn’t need any kind of release dance. You simply change the code, which is all in your one single repository. Note that this is strictly more powerful than what a split repo setup allows you: For really invasive refactorings you can still space out the work over multiple releases, e.g. when there’s so many users that you can just change them all at once without causing too big coordination pains.

    A huge benefit of making refactoring and code sharing easier is that you don’t have to carry around so much legacy gunk. That’s explained at length in the kernel’s no stable api nonsense document.

  • It doesn’t prevent you from creating your own experimental additions, which is one of the key benefits of the multi-repo setup. Add your code in your own fork and leave it at that - no one ever forces you to push the code back, or push it into the one single repo or even to the main organization, because there simply is no central repositories. This works really well, maybe too well, as evidenced by the millions of code lines which are out-of-tree in the various Android hardware vendor repositories.

In short, I think this is a strictly more powerful model, since you can always fall back to doing things exactly like you would with multiple disjoint repositories. Heck there’s even kernel drivers which are in their own repository, disjoint from the main kernel tree, like the proprietary Nvidia driver. Well it’s just a bit of source code glue around a blob, but since it can’t contain anything from the kernel for legal reasons it is the perfect example.

This looks like a monorepo horror show!

Yes and no.

At first glance the linux kernel looks like a monorepo because it contains everything. And lots of people learned that monorepos are really painful, because past a certain size they just stop scaling.

But looking closer, it’s very, very far away from a single git repository. Just looking at the upstream subsystem and driver repositories gives you a few hundred. If you look at the entire ecosystem, including hardware vendors, distributions, other linux-based OS and individual products, you easily have a few thousand major repositories, and many, many more in total. Not counting any git repo that’s just for private use by individual contributors.

The crucial distinction is that linux has one single file hierarchy as the shared namespace across everything, but lots and lots of different repos for all the different pieces and concerns. It’s a monotree with multiple repositories, not a monorepo.

Examples, please!

Before I go into explaining why github cannot currently support this workflow, at least if you want to retain the benefits of the github UI and integration, we need some examples of how this works in practice. The short summary is that it’s all done with git pull requests between maintainers.

The simple case is percolating changes up the maintainer hierarchy, until it eventually lands in a tree somewhere that is shipped. This is easy, because the pull request only ever goes from one repository to the next, and so could be done already using the current github UI.

Much more fun are cross-subsystem changes, because then the pull request flow stops being an acyclic graph and morphs into a mesh. The first step is to get the changes reviewed and tested by all the involved subsystems and their maintainers. In the github flow this would be a pull request submitted to multiple repositories simultaneously, with the one single discussion stream shared among them all. Since this is the kernel, this step is done through patch submission with a pile of different mailing lists and maintainers as recipients.

The way it’s reviewed is usually not the way it’s merged, instead one of the subsystems is selected as the leading one and takes the pull requests, as long as all other maintainers agree to that merge path. Usually it’s the subsystem most affected by a set of changes, but sometimes also the one that already has some other work in-flight which conflicts with the pull request. Sometimes also an entirely new repository and maintainer crew is created, this often happens for functionality which spans the entire tree and isn’t neatly contained to a few files and directories in one place. A recent example is the DMA mapping tree, which tries to consolidate work that thus far has been spread across drivers, platform maintainers and architecture support groups.

But sometimes there’s multiple subsystems which would both conflict with a set of changes, and which would all need to resolve some non-trivial merge conflict. In that case the patches aren’t just directly applied (a rebasing pull request on github), but instead the pull request with just the necessary patches, based on a commit common to all subsystems, is merged into all subsystem trees. The common baseline is important to avoid polluting a subsystem tree with unrelated changes. Since the pull is for a specific topic only, these branches are commonly called topic branches.

One example I was involved with added code for audio-over-HDMI support, which spanned both the graphics and sound driver subsystems. The same commits from the same pull request where both merged into the Intel graphics driver and also merged into the sound subsystem.

An entirely different example that this isn’t insane is the only other relevant general purpose large scale OS project in the world also decided to have a monotree, with a commit flow modelled similar to what’s going on in linux. I’m talking about the folks with such a huge tree that they had to write an entire new GVFS virtual filesystem provider to support it …

Dear Github

Unfortunately github doesn’t support this workflow, at least not natively in the github UI. It can of course be done with just plain git tooling, but then you’re back to patches on mailing lists and pull requests over email, applied manually. In my opinion that’s the single one reason why the kernel community cannot benefit from moving to github. There’s also the minor issue of a few top maintainers being extremely outspoken against github in general, but that’s a not really a technical issue. And it’s not just the linux kernel, it’s all huge projects on github in general which struggle with scaling, because github doesn’t really give them the option to scale to multiple repositories, while sticking to with a monotree.

In short, I have one simple feature request to github:

Please support pull requests and issue tracking spanning different repos of a monotree.

Simple idea, huge implications.

Repositories and Organizations

First, it needs to be possible to have multiple forks of the same repo in one organization. Just look at git.kernel.org, most of these repositories are not personal. And even if you might have different organizations for e.g. different subsystems, requiring an organization for each repo is silly amounts of overkill and just makes access and user managed unnecessarily painful. In graphics for example we’d have 1 repo each for the userspace test suite, the shared userspace library, and a common set of tools and scripts used by maintainers and developers, which would work in github. But then we’d have the overall subsystem repo, plus a repository for core subsystem work and additional repositories for each big drivers. Those would all be forks, which github doesn’t do. And each of these repos has a bunch of branches, at least one for feature work, and another one for bugfixes for the current release cycle.

Combining all branches into one repository wouldn’t do, since the point of splitting repos is that pull requests and issues are separated, too.

Related, it needs to be possible to establish the fork relationship after the fact. For new projects who’ve always been on github this isn’t a big deal. But linux will be able to move at most a subsystem at a time, and there’s already tons of linux repositories on github which aren’t proper github forks of each another.

Pull Requests

Pull request need to be attached to multiple repos at the same time, while keeping one unified discussion stream. You can already reassign a pull request to a different branch of repo, but not at multiple repositories at the same time. Reassigning pull requests is really important, since new contributors will just create pull requests against what they think is the main repo. Bots can then shuffle those around to all the repos listed in e.g. a MAINTAINERS file for a given set of files and changes a pull request contains. When I chatted with githubbers I originally suggested they’d implement this directly. But I think as long as it’s all scriptable that’s better left to individual projects, since there’s no real standard.

There’s a pretty funky UI challenge here since the patch list might be different depending upon the branch the pull request is against. But that’s not always a user error, one repo might simple have merged a few patches already.

Also, the pull request status needs to be different for each repo. One maintainer might close it without merging, since they agreed that the other subsystem will pull it in, while the other maintainer will merge and close the pull. Another tree might even close the pull request as invalid, since it doesn’t apply to that older version or vendor fork. Even more fun, a pull request might get merged multiple times, in each subsystem with a different merge commit.

Issues

Like pull requests, issues can be relevant for multiple repos, and might need to be moved around. An example would be a bug that’s first reported against a distribution’s kernel repository. After triage it’s clear it’s a driver bug still present in the latest development branch and hence also relevant for that repo, plus the main upstream branch and maybe a few more.

Status should again be separate, since once push to one repo the bugfix isn’t instantly available in all of them. It might even need additional work to get backported to older kernels or distributions, and some might even decide that’s not worth it and close it as WONTFIX, even thought the it’s marked as successfully resolved in the relevant subsystem repository.

Summary: Monotree, not Monorepo

The Linux Kernel is not going to move to github. But moving the Linux way of scaling with a monotree, but mutliple repos, to github as a concept will be really beneficial for all the huge projects already there: It’ll give them a new, and in my opinion, more powerful way to handle their unique challenges.

August 07, 2017

The biggest news this week is that I’ve resolved the window movement performance on Raspbian. After all my work on improving vc4 performance for overlapping CopyArea to parity with SW rendering, the end-user experience was still awful. It turned out that openbox just had a really naive event loop for window movement, so each pixel of motion of the mouse that got reported resulted in a window movement, even if there were 100 of those queued to the window manager in a stream.

I first fixed openbox to consume all the motion events it could and then set the window to the last position it knew of. This didn’t quite work, because thanks to it was only seeing one motion event read in at a time, and queuing up a reposition in between each one.

Next, I made openbox do a round-trip to the server after each copy, so that we get all the motion collected at once per copy that happens on the server side. Now, window dragging keeps up with the mouse as fast as you can wiggle it.

I also worked on backporting the NEON assembly for texture upload/download to the ARMv6 build. Jonas Pfeil started on this, but I want to rework the detection logic a bit to match what Mesa does for other asm support. That should land next week, so that we can have NEON on Raspbian.

On the VC5 front, the next big step is QPU instruction scheduling. So far I’ve removed a duplicate list of QPU instructions, and spent time thinking about how I want to structure the input to the scheduler (it would be good to have instructions reading uniforms be able to be reordered, but we can’t do that if I have the ldunif signal separate from the usage like I do today).

Finally, when I was reviewing daniels’s X11 Present modifiers protocol, I complained that he was representing 64-bit values as pairs of 32-bit values, when we (finally!) recently started using 64-bit values on the wire and in our code. He replied that he had tried that, but the SYNC extension was #defining CARD64 to an XSyncValue and there were conflicts over CARD64. CARD* is X11’s custom not-quite-stdint.h types for sized unsigned values, and XSyncValue is a struct with 2 32-bit values to represent a signed 64-bit value.

My solution was to just remove the CARD64 usage in the SYNC extension, packing and unpacking int64_ts from the wire protocol and then using normal 64-bit math. On 32-bit processors this should expand to the same code as before, but on 64-bit we will probably get actual 64-bit math that we didn’t before (unless your compiler is pretty clever). It also cleaned up some gross code for making XSyncValues of small values like “4” and “0”.

However, since I modified the code, I wanted to run the tests, and I found no tests. Now that we have the meson build system and simple-xinit, it was trivial to write some XCB touch-testing of the protocol and verify that my changes at least didn’t regress the protocol I’ve tested. Hopefully these tests can be a model for future automated protocol testing under “make check”. I didn’t port the test to automake, because automake makes testing awful.

Also this week:

  • Fixed a kernel crashing regression when vc4 fails to load completely since the BO labeling debug work.
  • Started merging a bunch of previous vc4 kernel patches thanks to review from Boris
  • Wrote a piglit test for the overlapping copy extension
  • Fixed up xserver’s Travis CI to run tests that start an X Server.
August 04, 2017

I reworked this week huge parts of my code and I have a feeling that I’m on the right track. I wrote a second mail to the xorg-devel mailing list and the feedback I got back was also way more positive than on the first try. Thanks again to Michel Dänzer for taking the time for this.

In the mail I described my current approach in detail. Basically I created a second code path in the Present DIX for DDX with the possibility to flip windows individually, like Xwayland is capable of doing it. In this code path I also changed the Present extension so that it will only ever signal the client that a certain used pixmap is able to be reused when the DDX has also given it free. In the Xwayland case it does this of course when the Wayland buffer associated with the pixmap has been released by the Wayland server.

What’s still somewhat fragile, but one of my personal darlings is flipping the pixmaps of child windows inside another window geometry. I had the idea to use Wayland sub-surfaces for that and at least on KWin it worked quite nicely. I like it, because it reaches deeply into the X rendering pipeline being a combination of very different internal objects of X and Wayland, which fit together in an amazingly smooth way.

As said though the sub-surface part is still somewhat fragile. I often had problems with dangling pointers because of all the different objects in play, in particular with the RRCrtcs provided by the DDX. Of course you can always go back and guard the code against it when you’ve encountered such a situation while testing. But the risk of missing edge cases is high.

For example everything seemed to work flawlessly at some point a few days ago with my two main test applications VLC and Neverball, but when I tried the Steam desktop client I got segfaults all the time when opening or closing the small drop down menus in the upper half of the window. I came to the conclusion, that this happened because the drop down menus as windows were reparented in between, the old parent was destroyed and by that the child window suddenly had a dangling pointer to the RRCrtc of its former parent. And yeah, I fixed this problem with some guards, but there might be more of these issues I just haven’t yet encountered, right?

What I’m considering instead right now in order to improve the situation in a more general sense is making the CRTCs and windows structs independent. Currently we make an implicit assumption that one window corresponds over its lifetime to exactly one RRCrtc. In Xwayland I created just one fake CRTC per window for that. Conceptually this makes sense, because we want to flip the windows individually and don’t only flip display front buffers as represented by their CRTCs, but the crux is that Present might try to access a CRTC object after the window has been deleted long ago. Making the objects independent again, we could directly test in Xwayland if a RRCrtc is still available as an xwl_output in the output list of the xwl_screen struct, just the way like it’s normally done.

Making CRTCs and windows independent would also bring back the Present code paths to being more similar again from a structural point of view, but it still might mean that the two code paths need to be dissociated more strongly. I’ll see this weekend how it works out and will report on my findings in the article next week.

When I proposed the project with my mentor, we worked out a bunch of features that the new Piper should have. These are listed in the Redesign Wiki, but here’s a high-level summary: A welcome screen, presenting a list of connected and supported devices. An error screen, presenting any problems in a user-friendly manner. The main screen, presenting the configuration pages: A page to configure resolutions (see part 7) A page to configure button mappings (see parts 9 and 10) A page to configure LEDs (see part 8) Support for device profiles (see part 9 for the start on this).
August 01, 2017

This week I worked on building an implementation of the overlapping blit extension for VC4. My first extension spec effectively said “unscaled overlapping BlitFramebuffer now gives you the answer you obviously want.” I had plumbed it all the way through to the backend, written spec text, and was getting ready to submit it for review, when I realized it wasn’t the spec I wanted.

The problem for X11, and specifically Raspbian’s desktop, is that we’re doing a CopyArea with a non-rectangular region. If we implement overlapping blits using BlitFramebuffer, we have a blit per rectangle in the region, each of which dispatches a rendering job to the kernel. Since rendering jobs only operate on tiles, those single-scanline rectangles at the top and bottom of your rounded window have 64x the cost they should, not even counting job dispatch overhead.

So, I went back to the drawing board and wrote a new spec. What I’m offering now is control over the basic HW functionality: You can flip a bit that says “Instead of tile raster order being HW-defined, you have two bits for left-to-right or right-to-left, and top-to-bottom or bottom-to-top.” Combined with GL_ARB_texture_barrier (where I provide the definition for what guarantees the tile raster order provides you), we can submit drawing for all the rectangles in the copy in a single job.

So far, review in Mesa is going well, though the kernel has silence so far. Jason Ekstrand at Intel noted that while they can’t do tile raster order, they could do overlapping 1:1 blits in most cases, and could break things down in the others to a series of valid operations with cache flushes in between. However, his description of the breakdown process is already something that Glamor could do using GL_ARB_texture_barrier:

while (dest_region is nonempty) {
    src_region = dest_region + src copy offset
    this_region = dest_region - src_region
    copy this region;
    glTextureBarrier();
    dest_region -= this_region
}

While I don’t need to do this for VC4, since we have a better option with my extension, it still seems like a good thing to implement.

For VC4, I have x11perf -copywinwin500 performing the same as the previous SW rasterization that Raspbian had. Next week: fixing the interactivity now that the raw performance is in place.

Other news tidbits this week:

  • Landed the CLIF-style debug dumping for VC4
  • Landed the bank-aware register selection optimization for VC4, at last.
  • Debugged an assertion failure regression on VC4 from Marek’s CPU optimization work
  • Tested and landed Hans Verkuil’s HDMI CEC driver for VC4.
  • Landed kernel-side BO labeling (debugfs memory allocation descriptions) support for VC4
  • Did some review on the DRI3 plane modifier patch series from Collabora (a great start, but underspecified in some critical ways).
July 30, 2017

In the previous post we talked about the Phong lighting model as a means to represent light in a scene. Once we have light, we can think about implementing shadows, which are the parts of the scene that are not directly exposed to light sources. Shadow mapping is a well known technique used to render shadows in a scene from one or multiple light sources. In this post we will start discussing how to implement this, specifically, how to render the shadow map image, and the next post will cover how to use the shadow map to render shadows in the scene.

Note: although the code samples in this post are for Vulkan, it should be easy for the reader to replicate the implementation in OpenGL. Also, my OpenGL terrain renderer demo implements shadow mapping and can also be used as a source code reference for OpenGL.

Algorithm overview

Shadow mapping involves two passes, the first pass renders the scene from te point of view of the light with depth testing enabled and records depth information for each fragment. The resulting depth image (the shadow map) contains depth information for the fragments that are visible from the light source, and therefore, are occluders for any other fragment behind them from the point of view of the light. In other words, these represent the only fragments in the scene that receive direct light, every other fragment is in the shade. In the second pass we render the scene normally to the render target from the point of view of the camera, then for each fragment we need to compute the distance to the light source and compare it against the depth information recorded in the previous pass to decice if the fragment is behind a light occluder or not. If it is, then we remove the diffuse and specular components for the fragment, making it look shadowed.

In this post I will cover the first pass: generation of the shadow map.

Producing the shadow map image

Note: those looking for OpenGL code can have a look at this file ter-shadow-renderer.cpp from my OpenGL terrain renderer demo, which contains the shadow map renderer that generates the shadow map for the sun light in that demo.

Creating a depth image suitable for shadow mapping

The shadow map is a regular depth image were we will record depth information for fragments in light space. This image will be rendered into and sampled from. In Vulkan we can create it like this:

...
VkImageCreateInfo image_info = {};
image_info.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
image_info.pNext = NULL;
image_info.imageType = VK_IMAGE_TYPE_2D;
image_info.format = VK_FORMAT_D32_SFLOAT;
image_info.extent.width = SHADOW_MAP_WIDTH;
image_info.extent.height = SHADOW_MAP_HEIGHT;
image_info.extent.depth = 1;
image_info.mipLevels = 1;
image_info.arrayLayers = 1;
image_info.samples = VK_SAMPLE_COUNT_1_BIT;
image_info.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
image_info.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT |
                   VK_IMAGE_USAGE_SAMPLED_BIT;
image_info.queueFamilyIndexCount = 0;
image_info.pQueueFamilyIndices = NULL;
image_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
image_info.flags = 0;

VkImage image;
vkCreateImage(device, &image_info, NULL, &image);
...

The code above creates a 2D image with a 32-bit float depth format. The shadow map’s width and height determine the resolution of the depth image: larger sizes produce higher quality shadows but of course this comes with an additional computing cost, so you will probably need to balance quality and performance for your particular target. In the first pass of the algorithm we need to render to this depth image, so we include the VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT usage flag, while in the second pass we will sample the shadow map from the fragment shader to decide if each fragment is in the shade or not, so we also include the VK_IMAGE_USAGE_SAMPLED_BIT.

One more tip: when we allocate and bind memory for the image, we probably want to request device local memory too (VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT) for optimal performance, since we won’t need to map the shadow map memory in the host for anything.

Since we are going to render to this image in the first pass of the process we also need to create a suitable image view that we can use to create a framebuffer. There are no special requirements here, we just create a view with the same format as the image and with a depth aspect:

...
VkImageViewCreateInfo view_info = {};
view_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;
view_info.pNext = NULL;
view_info.image = image;
view_info.format = VK_FORMAT_D32_SFLOAT;
view_info.components.r = VK_COMPONENT_SWIZZLE_R;
view_info.components.g = VK_COMPONENT_SWIZZLE_G;
view_info.components.b = VK_COMPONENT_SWIZZLE_B;
view_info.components.a = VK_COMPONENT_SWIZZLE_A;
view_info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_DEPTH_BIT;
view_info.subresourceRange.baseMipLevel = 0;
view_info.subresourceRange.levelCount = 1;
view_info.subresourceRange.baseArrayLayer = 0;
view_info.subresourceRange.layerCount = 1;
view_info.viewType = VK_IMAGE_VIEW_TYPE_2D;
view_info.flags = 0;

VkImageView shadow_map_view;
vkCreateImageView(device, &view_info, NULL, &view);
...

Rendering the shadow map

In order to generate the shadow map image we need to render the scene from the point of view of the light, so first, we need to compute the corresponding View and Projection matrices. How we calculate these matrices depends on the type of light we are using. As described in the previous post, we can consider 3 types of lights: spotlights, positional lights and directional lights.

Spotlights are the easiest for shadow mapping, since with these we use regular perspective projection.

Positional lights work similar to spotlights in the sense that they also use perspective projection, however, because these are omnidirectional, they see the entire scene around them. This means that we need to render a shadow map that contains scene objects in all directions around the light. We can do this by using a cube texture for the shadow map instead of a regular 2D texture and render the scene 6 times adjusting the View matrix to capture scene objects in front of the light, behind it, to its left, to its right, above and below. In this case we want to use a field of view of 45º with the projection matrix so that the set of 6 images captures the full scene around the light source with no image overlaps.

Finally, we have directional lights. In the previous post I mentioned that these lights model light sources which rays are parallel and because of this feature they cast regular shadows (that is, shadows that are not perspective projected). Thus, to render shadow maps for directional lights we want to use orthographic projection instead of perspective projection.

Projected shadow from a point light source
Regular shadow from a directional light source

In this post I will focus on creating a shadow map for a spotlight source only. I might write follow up posts in the future covering other light sources, but for the time being, you can have a look at my OpenGL terrain renderer demo if you are interested in directional lights.

So, for a spotlight source, we just define a regular perspective projection, like this:

glm::mat4 clip = glm::mat4(1.0f, 0.0f, 0.0f, 0.0f,
                           0.0f,-1.0f, 0.0f, 0.0f,
                           0.0f, 0.0f, 0.5f, 0.0f,
                           0.0f, 0.0f, 0.5f, 1.0f);

glm::mat4 light_projection = clip *
      glm::perspective(glm::radians(45.0f),
                       (float) SHADOW_MAP_WIDTH / SHADOW_MAP_HEIGHT,
                       LIGHT_NEAR, LIGHT_FAR);

The code above generates a regular perspective projection with a field of view of 45º. We should adjust the light’s near and far planes to make them as tight as possible to reduce artifacts when we use the shadow map to render the shadows in the scene (I will go deeper into this in a later post). In order to do this we should consider that the near plane can be increased to reflect the closest that an object can be to the light (that might depend on the scene, of course) and the far plane can be decreased to match the light’s area of influence (determined by its attenuation factors, as explained in the previous post).

The clip matrix is not specific to shadow mapping, it just makes it so that the resulting projection considers the particularities of how the Vulkan coordinate system is defined (Y axis is inversed, Z range is halved).

As usual, the projection matrix provides us with a projection frustrum, but we still need to point that frustum in the direction in which our spotlight is facing, so we also need to compute the view matrix transform of our spotlight. One way to define the direction in which our spotlight is facing is by having the rotation angles of spotlight on each axis, similarly to what we would do to compute the view matrix of our camera:

glm::mat4
compute_view_matrix_for_rotation(glm::vec3 origin, glm::vec3 rot)
{
   glm::mat4 mat(1.0);
   float rx = DEG_TO_RAD(rot.x);
   float ry = DEG_TO_RAD(rot.y);
   float rz = DEG_TO_RAD(rot.z);
   mat = glm::rotate(mat, -rx, glm::vec3(1, 0, 0));
   mat = glm::rotate(mat, -ry, glm::vec3(0, 1, 0));
   mat = glm::rotate(mat, -rz, glm::vec3(0, 0, 1));
   mat = glm::translate(mat, -origin);
   return mat;
}

Here, origin is the position of the light source in world space, and rot represents the rotation angles of the light source on each axis, representing the direction in which the spotlight is facing.

Now that we have the View and Projection matrices that define our light space we can go on and render the shadow map. For this we need to render scene as we normally would but instead of using our camera’s View and Projection matrices, we use the light’s. Let’s have a look at the shadow map rendering code:

Render pass

static VkRenderPass
create_shadow_map_render_pass(VkDevice device)
{
   VkAttachmentDescription attachments[2];

   // Depth attachment (shadow map)
   attachments[0].format = VK_FORMAT_D32_SFLOAT;
   attachments[0].samples = VK_SAMPLE_COUNT_1_BIT;
   attachments[0].loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR;
   attachments[0].storeOp = VK_ATTACHMENT_STORE_OP_STORE;
   attachments[0].stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE;
   attachments[0].stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE;
   attachments[0].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
   attachments[0].finalLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
   attachments[0].flags = 0;

   // Attachment references from subpasses
   VkAttachmentReference depth_ref;
   depth_ref.attachment = 0;
   depth_ref.layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL;

   // Subpass 0: shadow map rendering
   VkSubpassDescription subpass[1];
   subpass[0].pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS;
   subpass[0].flags = 0;
   subpass[0].inputAttachmentCount = 0;
   subpass[0].pInputAttachments = NULL;
   subpass[0].colorAttachmentCount = 0;
   subpass[0].pColorAttachments = NULL;
   subpass[0].pResolveAttachments = NULL;
   subpass[0].pDepthStencilAttachment = &depth_ref;
   subpass[0].preserveAttachmentCount = 0;
   subpass[0].pPreserveAttachments = NULL;

   // Create render pass
   VkRenderPassCreateInfo rp_info;
   rp_info.sType = VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO;
   rp_info.pNext = NULL;
   rp_info.attachmentCount = 1;
   rp_info.pAttachments = attachments;
   rp_info.subpassCount = 1;
   rp_info.pSubpasses = subpass;
   rp_info.dependencyCount = 0;
   rp_info.pDependencies = NULL;
   rp_info.flags = 0;

   VkRenderPass render_pass;
   VK_CHECK(vkCreateRenderPass(device, &rp_info, NULL, &render_pass));

   return render_pass;
}

The render pass is simple enough: we only have one attachment with the depth image and one subpass that renders to the shadow map target. We will start the render pass by clearing the shadow map and by the time we are done we want to store it and transition it to layout VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL so we can sample from it later when we render the scene with shadows. Notice that because we only care about depth information, the render pass doesn’t include any color attachments.

Framebuffer

Every rendering job needs a target framebuffer, so we need to create one for our shadow map. For this we will use the image view we created from the shadow map image. We link this framebuffer target to the shadow map render pass description we have just defined:

VkFramebufferCreateInfo fb_info;
fb_info.sType = VK_STRUCTURE_TYPE_FRAMEBUFFER_CREATE_INFO;
fb_info.pNext = NULL;
fb_info.renderPass = shadow_map_render_pass;
fb_info.attachmentCount = 1;
fb_info.pAttachments = &shadow_map_view;
fb_info.width = SHADOW_MAP_WIDTH;
fb_info.height = SHADOW_MAP_HEIGHT;
fb_info.layers = 1;
fb_info.flags = 0;

VkFramebuffer shadow_map_fb;
vkCreateFramebuffer(device, &fb_info, NULL, &shadow_map_fb);

Pipeline description

The pipeline we use to render the shadow map also has some particularities:

Because we only care about recording depth information, we can typically skip any vertex attributes other than the positions of the vertices in the scene:

...
VkVertexInputBindingDescription vi_binding[1];
VkVertexInputAttributeDescription vi_attribs[1];

// Vertex attribute binding 0, location 0: position
vi_binding[0].binding = 0;
vi_binding[0].inputRate = VK_VERTEX_INPUT_RATE_VERTEX;
vi_binding[0].stride = 2 * sizeof(glm::vec3);

vi_attribs[0].binding = 0;
vi_attribs[0].location = 0;
vi_attribs[0].format = VK_FORMAT_R32G32B32_SFLOAT;
vi_attribs[0].offset = 0;

VkPipelineVertexInputStateCreateInfo vi;
vi.sType = VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO;
vi.pNext = NULL;
vi.flags = 0;
vi.vertexBindingDescriptionCount = 1;
vi.pVertexBindingDescriptions = vi_binding;
vi.vertexAttributeDescriptionCount = 1;
vi.pVertexAttributeDescriptions = vi_attribs;
...
pipeline_info.pVertexInputState = &vi;
...

The code above defines a single vertex attribute for the position, but assumes that we read this from a vertex buffer that packs interleaved positions and normals for each vertex (each being a vec3) so we use the binding’s stride to jump over the normal values in the buffer. This is because in this particular example, we have a single vertex buffer that we reuse for both shadow map rendering and normal scene rendering (which requires vertex normals for lighting computations).

Again, because we do not produce color data, we can skip the fragment shader and our vertex shader is a simple passthough instead of the normal vertex shader we use with the scene:

....
VkPipelineShaderStageCreateInfo shader_stages[1];
shader_stages[0].sType =
   VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
shader_stages[0].pNext = NULL;
shader_stages[0].pSpecializationInfo = NULL;
shader_stages[0].flags = 0;
shader_stages[0].stage = VK_SHADER_STAGE_VERTEX_BIT;
shader_stages[0].pName = "main";
shader_stages[0].module = create_shader_module("shadowmap.vert.spv", ...);
...
pipeline_info.pStages = shader_stages;
pipeline_info.stageCount = 1;
...

This is how the shadow map vertex shader (shadowmap.vert) looks like in GLSL:

#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(std140, set = 0, binding = 0) uniform vp_ubo {
    mat4 ViewProjection;
} VP;

layout(std140, set = 0, binding = 1) uniform m_ubo {
     mat4 Model[16];
} M;

layout(location = 0) in vec3 in_position;

void main()
{
   vec4 pos = vec4(in_position.x, in_position.y, in_position.z, 1.0);
   vec4 world_pos = M.Model[gl_InstanceIndex] * pos;
   gl_Position = VP.ViewProjection * world_pos;
}

The shader takes the ViewProjection matrix of the light (we have already multiplied both together in the host) and a UBO with the Model matrices of each object in the scene as external resources (we use instanced rendering in this particular example) as well as a single vec3 input attribute with the vertex position. The only job of the vertex shader is to compute the position of the vertex in the transformed space (the light space, since we are passing the ViewProjection matrix of the light), nothing else is done here.

Command buffer

The command buffer is pretty similar to the one we use with the scene, only that we render to the shadow map image instead of the usual render target. In the shadow map render pass description we have indicated that we will clear it, so we need to include a depth clear value. We also need to make sure that we set the viewport and sccissor to match the shadow map dimensions:

...
VkClearValue clear_values[1];
clear_values[0].depthStencil.depth = 1.0f;
clear_values[0].depthStencil.stencil = 0;

VkRenderPassBeginInfo rp_begin;
rp_begin.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
rp_begin.pNext = NULL;
rp_begin.renderPass = shadow_map_render_pass;
rp_begin.framebuffer = shadow_map_framebuffer;
rp_begin.renderArea.offset.x = 0;
rp_begin.renderArea.offset.y = 0;
rp_begin.renderArea.extent.width = SHADOW_MAP_WIDTH;
rp_begin.renderArea.extent.height = SHADOW_MAP_HEIGHT;
rp_begin.clearValueCount = 1;
rp_begin.pClearValues = clear_values;

vkCmdBeginRenderPass(shadow_map_cmd_buf,
                     &rp_begin,
                     VK_SUBPASS_CONTENTS_INLINE);

VkViewport viewport;
viewport.height = SHADOW_MAP_HEIGHT;
viewport.width = SHADOW_MAP_WIDTH;
viewport.minDepth = 0.0f;
viewport.maxDepth = 1.0f;
viewport.x = 0;
viewport.y = 0;
vkCmdSetViewport(shadow_map_cmd_buf, 0, 1, &viewport);

VkRect2D scissor;
scissor.extent.width = SHADOW_MAP_WIDTH;
scissor.extent.height = SHADOW_MAP_HEIGHT;
scissor.offset.x = 0;
scissor.offset.y = 0;
vkCmdSetScissor(shadow_map_cmd_buf, 0, 1, &scissor);
...

Next, we bind the shadow map pipeline we created above, bind the vertex buffer and descriptor sets as usual and draw the scene geometry.

...
vkCmdBindPipeline(shadow_map_cmd_buf,
                  VK_PIPELINE_BIND_POINT_GRAPHICS,
                  shadow_map_pipeline);

const VkDeviceSize offsets[1] = { 0 };
vkCmdBindVertexBuffers(shadow_cmd_buf, 0, 1, vertex_buf, offsets);

vkCmdBindDescriptorSets(shadow_map_cmd_buf,
                        VK_PIPELINE_BIND_POINT_GRAPHICS,
                        shadow_map_pipeline_layout,
                        0, 1,
                        shadow_map_descriptor_set,
                        0, NULL);

vkCmdDraw(shadow_map_cmd_buf, ...);

vkCmdEndRenderPass(shadow_map_cmd_buf);
...

Notice that the shadow map pipeline layout will be different from the one used with the scene too. Specifically, during scene rendering we will at least need to bind the shadow map for sampling and we will probably also bind additional resources to access light information, surface materials, etc that we don’t need to render the shadow map, where we only need the View and Projection matrices of the light plus the UBO with the model matrices of the objects in the scene.

We are almost there, now we only need to submit the command buffer for execution to render the shadow map:

...
VkPipelineStageFlags shadow_map_wait_stages = 0;
VkSubmitInfo submit_info = { };
submit_info.pNext = NULL;
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 0;
submit_info.pWaitSemaphores = NULL;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &signal_sem;
submit_info.pWaitDstStageMask = 0;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &shadow_map_cmd_buf;

vkQueueSubmit(queue, 1, &submit_info, NULL);
...

Because the next pass of the algorithm will need to sample the shadow map during the final scene rendering,we use a semaphore to ensure that we complete this work before we start using it in the next pass of the algorithm.

In most scenarios, we will want to render the shadow map on every frame to account for dynamic objects that move in the area of effect of the light or even moving lights, however, if we can ensure that no objects have altered their positions inside the area of effect of the light and that the light’s description (position/direction) hasn’t changed, we may not need need to regenerate the shadow map and save some precious rendering time.

Visualizing the shadow map

After executing the shadow map rendering job our shadow map image contains the depth information of the scene from the point of view of the light. Before we go on and start using this as input to produce shadows in our scene, we should probably try to visualize the shadow map to verify that it is correct. For this we just need to submit a follow-up job that takes the shadow map image as a texture input and renders it to a quad on the screen. There is one caveat though: when we use perspective projection, Z values in the depth buffer are not linear, instead precission is larger at distances closer to the near plane and drops as we get closer to the far place in order to improve accuracy in areas closer to the observer and avoid Z-fighting artifacts. This means that we probably want to linearize our shadow map values when we sample from the texture so that we can actually see things, otherwise most things that are not close enough to the light source will be barely visible:

#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout(std140, set = 0, binding = 0) uniform mvp_ubo {
    mat4 mvp;
} MVP;

layout(location = 0) in vec2 in_pos;
layout(location = 1) in vec2 in_uv;

layout(location = 0) out vec2 out_uv;

void main()
{
   gl_Position = MVP.mvp * vec4(in_pos.x, in_pos.y, 0.0, 1.0);
   out_uv = in_uv;
}
#version 400

#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable

layout (set = 1, binding = 0) uniform sampler2D image;

layout(location = 0) in vec2 in_uv;

layout(location = 0) out vec4 out_color;

void main()
{
   float depth = texture(image, in_uv).r;
   out_color = vec4(1.0 - (1.0 - depth) * 100.0);
}

We can use the vertex and fragment shaders above to render the contents of the shadow map image on to a quad. The vertex shader takes the quad’s vertex positions and texture coordinates as attributes and passes them to the fragment shader, while the fragment shader samples the shadow map at the provided texture coordinates and then “linearizes” the depth value so that we can see better. The code in the shader doesn’t properly linearize the depth values we read from the shadow map (that requires to pass the Z-near and Z-far values used in the projection), but for debugging purposes this works well enough for me, if you use different Z clipping planes you may need to alter the ‘100.0’ value to get good results (or you might as well do a proper conversion considering your actual Z-near and Z-far values).

Visualizing the shadow map

The image shows the shadow map on top of the scene. Darker colors represent smaller depth values, so these are fragments closer to the light source. Notice that we are not rendering the floor geometry to the shadow map since it can’t cast shadows on any other objects in the scene.

Conclusions

In this post we have described the shadow mapping technique as a combination of two passes: the first pass renders a depth image (the shadow map) with the scene geometry from the point of view of the light source. To achieve this, we need a passthrough vertex shader that only transforms the scene vertex positions (using the view and projection transforms from the light) and we can skip the fragment shader completely since we do not care for color output. The second pass, which we will cover in the next post, takes the shadow map as input and uses it to render shadows in the final scene.

July 29, 2017

Alex Larsson talks about Flathub at GUADEC 2017At the Gtk+ hackfest in London earlier this year, we stole an afternoon from the toolkit folks (sorry!) to talk about Flatpak, and how we could establish a “critical mass” behind the Flatpak format. Bringing Linux container and sandboxing technology together with ostree, we’ve got a technology which solves real world distribution, technical and security problems which have arguably held back the Linux desktop space and frustrated ISVs and app developers for nearly 20 years. The problem we need to solve, like any ecosystem, is one of users and developers – without stuff you can easily get in Flatpak format, there won’t be many users, and without many users, we won’t have a strong or compelling incentive for developers to take their precious time to understand a new format and a new technology.

As Alex Larsson said in his GUADEC talk yesterday: Decentralisation is good. Flatpak is a tool that is totally agnostic of who is publishing the software and where it comes from. For software freedom, that’s an important thing because we want technology to empower users, rather than tell them what they can or can’t do. Unfortunately, decentralisation makes for a terrible user experience. At present, the Flatpak webpage has a manually curated list of links to 10s of places where you can find different Flatpaks and add them to your system. You can’t easily search and browse to find apps to try out – so it’s clear that if the current situation remains we’re not going to be able to get a critical mass of users and developers around Flatpak.

Enter Flathub. The idea is that by creating an obvious “center of gravity” for the Flatpak community to contribute and build their apps, users will have one place to go and find the best that the Linux app ecosystem has to offer. We can take care of the boring stuff like running a build service and empower Linux application developers to choose how and when their app gets out to their users. After the London hackfest we sketched out a minimum viable system – Github, Buildbot and a few workers – and got it going over the past few months, culminating in a mini-fundraiser to pay for the hosting of a production-ready setup. Thanks to the 20 individuals who supported our fundraiser, to Mythic Beasts who provided a server along with management, monitoring and heaps of bandwidth, and to Codethink and Scaleway who provide our ARM and Intel workers respectively.

We inherit our core principles from the Flatpak project – we want the Flatpak technology to succeed at alleviating the issues faced by app developers in targeting a diverse set of Linux platforms. None of this stops you from building and hosting your own Flatpak repos and we look forward to this being a wide and open playing field. We care about the success of the Linux desktop as a platform, so we are open to proprietary applications through Flatpak’s “extra data” feature where the client machine downloads 3rd party binaries. They are correctly labeled as such in the AppStream, so will only be shown if you or your OS has configured GNOME Software to show you apps with proprietary licenses, respecting the user’s preference.

The new infrastructure is up and running and I put it into production on Thursday. We rebuilt the whole repository on the new system over the course of the week, signing everything with our new 4096-bit key stored on a Yubikey smartcard USB key. We have 66 apps at the moment, although Alex is working on bringing in the GNOME apps at present – we hope those will be joined soon by the KDE apps, and Endless is planning to move over as many of our 3rd party Flatpaks as possible over the coming months.

So, thanks again to Alex and the whole Flatpak community, and the individuals and the companies who supported making this a reality. You can add the repository and get downloading right away. Welcome to Flathub! Go forth and flatten… 🙂

Flathub logo

Like all good blog posts, this one starts with an apology about not blogging for ages – in my case it looks like it’s been about 7 years which is definitely a new personal best (perhaps the equally or more remarkable thing is that I have diligently kept WordPress running in the meantime). In that time, as you might expect, a few things have happened, like I met a wonderful woman and fell in love and we have two wonderful children. I also decided to part ways with my “first baby” and leave my role as CTO & co-founder of Collabora. This was obviously a very tough decision – it’s a fantastic team where I met and made many life-long friends, and they are still going strong and doing awesome things with Open Source. However, shortly after that, in February last year, I was lucky enough to land what is basically a dream job working at Endless Computers as the VP of Deployment.

As I’m sure most readers know, Endless is building an OS to bring personal computing to millions of new users across the world. It’s based on Free Software like GNOME, Debian, ostree and Flatpak, and the more successful Endless is, the more people who get access to education, technology and opportunity – and the more FOSS users and developers there will be in the world. But in my role, I get to help define the product, understand our users and how technology might help them, take Open Source out to new people, solve commercial problems to get Endless OS out into the world, manage a fantastic team and work with bright people, learn from great managers and mentors, and still find time to squash bugs, review and write more code than I used to. Like any startup, we have a lot to do and not enough time to do it, so although there aren’t quite enough days in the week, I’m really happy!

In any case, the main point of this blog post is that I’m at GUADEC in Manchester right now, and I’d like to blog about Flathub, but I thought it would be weird to just show up and say that after 7 years of silence without saying hello again. 🙂

July 28, 2017

What is the worst thing, that can happen to someone’s work? Probably showing it to someone and he isn’t impressed at all by it. Guess, what happened to me! I said in the last post, that I wanted to reach out to the good people from the xorg-devel mailing list for some feedback on my work so far. I did, but the response wasn’t that reassuring.

Basically several fundamental parts of my code were questioned. OK, the problem of timings with the frame callback was easily to solve - just don’t do it for now, it’s not really necessary. But the problems pointed out by Michel had a more profound impact. How I should deal with the Present extension and what was requested by Michel couldn’t easily be integrated in the code I had written.

But I didn’t lose my motivation because of this setback and of course I’m deeply thankful for the voiced critique by Pekka and Michel, since I can only learn from their feedback and for sure better rework the code now than later. I did exactly that in the last few days. And in my opinion with great success.

I interpreted Michel’s mail, such that I’m allowed to make larger changes to the Present extension code. Otherwise the rootless flipping wouldn’t be possible for sure. So I added a “Rootless mode” to the extension, which of course currently only is used by the Xwayland DDX, but could be used by other DDX in the future with similar functionality. It allows to flip windows individually and without the need to be full screen. This is a huge functionality boost in comparison to the normal X behavior, where a window needs to be full screen on the whole _Screen, and this in particular means that it won’t work with a full screen window on a single display in a RandR environment.

Of course some big change like this never works directly. For example I forgot to increase the reference counter on the pixmap remembered for restoring the window and only after tedious bug search found that this was the culprit for sporadic segfaults when exiting the full screen mode of the VLC player.

And yes, in order to be able to flip the pixmap not only in full screen mode but even inside a window with other sub-windows being composited around it, more was necessary. The solution was in the end to use a sub-surface for the flipping pixmap. And I’m not sure if it will be viable in the end, but I’m damn proud that it works so far.

Sometimes there are still visual artifacts I experience on window movements. The compositor doesn’t seem to get informed that the sub-surface has been moved away and still paints its content where it was before. But maybe the culprit is in this case KWin not handling the sub-surface correctly. I’ll need to cross-check this problem with some other compositor like Weston.

And I plan on writing another mail this weekend to the mailing list. Let’s see how this radical new beginning will resonate.

I can start this blog post by sharing the news that I have passed for my second evaluation! This means that I’m now on the last sprint towards the finish line, with ahead of me profile support and the welcome screen. Yes; you’ve read that correctly: I didn’t mention the button page. I managed to fix that this week and it’s now pending review and some final issues (more about that below).
July 27, 2017

A few days ago, I pushed code for button debouncing into libinput, scheduled for libinput 1.9. What is button debouncing you ask? Well, I'm glad you asked, because otherwise typing this blog post would've been a waste of time :)

Over in Utopia, when you press the button on a device, you get a press event from the hardware. When you release said button, you get a release event from the hardware. Together, they form the button click interaction we have come to learn and love over the last couple of decades. Life is generally merry and the sunshine to rainbow to lollipop ratio is good. Meanwhile, over here in the real world, buttons can be quite dodgy, don't always work like they're supposed to, lollipops are unhealthy and boy, have you seen that sunburn the sunshine gave me? One way how buttons may not work is that they can lose contact for a fraction of a second and send release events even though the button is being held down. The device usually detects that the button is still being down in the next hardware cycle (~8ms on most devices) and thus sends another button press.

For us, there are not a lot of hints that this is bad hardware besides the timestamps. It's not possible for a user to really release and press a button within 8ms, so we can take this as a signal for dodgy hardware. But at least that's someting. All we need to do is ignore the release event (and subsequent button event) and only release when the button is released properly. This requires timeouts and delays of the event, something we generally want to avoid unless absolutely necessary. So the behaviour libinput has now is enabled but inactive button debouncing on all devices. We monitor button release and button press timestamps, but otherwise leave the events as-is, so no delays are introduced. Only if a device sends release/press combos with unfeasably short timeouts, activate button debouncing. Once active, we filter all button release events and instead set a timer. Once the timer expires, we send the button release event. But if at any time before then another button press is detected, the scheduled release is discarded, the button press is filtered and no event is sent. Thus, we paper over the release/press combo the hardware gives us and to anyone using libinput, it will look like the button was held down for the whole time.

There's one downside with this approach - the very first button debounce to happen on a device will still trigger an erroneous button release event. It remains to be seen whether this is a problem in real-world scenarios. That's the cost of having it as an auto-enabling feature rather than an explicit configuration option.

If you do have a mouse that suffers from button bouncing, I recommend you try libinput's master branch and file any issues if the debouncing doesn't work as it should. Might as well get any issues fixed before we have a release.

July 26, 2017
A quick update, as we've touched upon Evince recently.

I mentioned that we switched from using external tools for decompression to using libarchive. That's not the whole truth, as we switched to using libarchive for CBZ, CB7 and the infamous CBT, but used a copy/paste version of unarr to support RAR files, as libarchive support lacks some needed features.

We hope to eventually remove the internal copy of unarr, but, as a stop-gap, that allowed us to start supporting CBR comics out of the box, and it's always a good thing when you have one less non-free package to grab from somewhere to access your media.

The second new format is really two formats, from either side of the 2-digit-year divide: PostScript-based Adobe Illustrator and PDF-based Adobe Illustrator. Evince now declares to support "the format" if both of the backends are built and supported. It only took 12 years, and somebody stumbling upon the feature request while doing bug triaging. The nooks and crannies of free software where the easy feature requests get lost :)


Both features will appear in GNOME 3.26, the out-of-the-box CBR support is however available now in an update for the just released Fedora 26.

I know many of you have wanted to test running Wayland on NVidia. The work on this continues between Jonas Ådahl, Adam Jackson and various developers at NVidia. It is not ready for primetime yet as we are still working on the server side glvnd piece we need for XWayland. That said with both Adam Jackson looking at this from our side and Kyle Brenneman looking at it from NVidia I am sure we will be able to hash out the remaining open questions and get that done.

In the meantime Miguel A. Vico from NVidia has set up a Copr to let people start testing using EGLStreams under Wayland. I haven’t tested it myself yet, but if you do and have trouble make sure to let Miguel and Jonas know.

As a sidenote, I am heading off to GUADEC in Manchester tomorrow and we do plan to discuss efforts like these there. We have team members like Jonas Ådahl flying in from Taiwan and Peter Hutterer flying in from Australia, so it will be a great chance to meet core developers who are far away from us in terms of timezone and geographical distance. GUADEC this year should be a lot of fun and from what I hear we are going to have record level attendance this year based early registration numbers, so if you can make it Manchester I strongly recommend joining us as I think this years event will have a lot of energy and a lot of interesting discussions on what the next steps are for GNOME.

July 24, 2017

This week’s VC5 progress included:

  • Major fixes to texture tiling and layout
  • Major fixes to branching in shaders (the uniform stream didn’t branch the same as the PC)
  • Implementing textureSize()
  • Implementing texelFetch()
  • Implementing textureQueryLevels()
  • Partially implementing transform feedback
  • Fixing gl_VertexID/gl_InstanceID usage.
  • Fixing some NaN behavior
  • Fixing assertion failures with PointSize set when drawing non-points.
  • Fixing use-after-free with Z or stencil (but not both) clears.

VC5’s GLES3 correctness is reaching the point where I need to start building the 7268 kernel driver, and then do some of the basic performance work (shader threading, QPU instruction scheduling)

I made another small fix to the X11 copy performance series I have, that improves things when windows have rounded corners, and spent a while planning out next week’s work on the overlapping blit extension.

I sent out another version of the DSI series (effectively v6). The bridge driver maintainers seem happy with my core DSI support changes, so hopefully the panel maintainer will be OK with them as well.

July 21, 2017

I planned on writing about the Present extension this week, but I’ll postpone this since I’m currently strongly absorbed into finding the last rough edges of a first patch I can show off. I then hope to get some feedback on this from other developers in the xorg-devel mailing list.

Another reason is that I stalled my work on the Present extension for now and try to get first my Xwayland code working. My mentor Daniel recommended that to me since the approach I pursued in my work on Present might be more difficult than I first assessed. At least it is something similar to what other way more experienced developers than myself tried in the past and weren’t able to do according to Daniel. My idea was to make Present flip per CRTC only, but this would clash with Pixmaps being linked to the whole screen only. There are no Pixmaps only for CRTCs in X.

On the other hand when accepting the restriction of only being able to flip one window at a time my code already works quite good. The flipping is smooth and at least in a short test also improved the frame rate. But the main problem I had and still to some degree have, is that stopping the flipping can fail. The reason seems to be that the Present extension sets always the Screen Pixmap on flips. But when I test my work with KWin, it drives Xwayland in rootless mode, i.e. without a Screen Pixmap and only the Window Pixmaps. I’m currently looking into how to circumvent this in Xwayland. I think it’s possible, but I need to look very carefully on how to change the process in order to not forget necessary cleanups on the flipped Pixmaps. I hope though that I’m able to solve these issues already this weekend and then get some feedback on the xorg-devel mailing list.

As always you can find my latest work on my working branch on GitHub.

@GodTributes took over my title, soz.

Dude, where's my maintainer?

Last year, probably as a distraction from doing anything else, or maybe because I was asked, I started reviewing bugs filed as a result of automated flaw discovery tools (from Coverity to UBSan via fuzzers) being run on gdk-pixbuf.

Apart from the security implications of a good number of those problems, there was also the annoyance of having a busted image file bring down your file manager, your desktop, or even an app that opened a file chooser either because it was broken, or because the image loader for that format didn't check for the sanity of memory allocations.

(I could have added links to Bugzilla entries for each one of the problems above, but that would just make it harder to read)

Two big things happened in gdk-pixbuf 2.36.1, which was used in GNOME 3.24:

  • the removal of GdkPixdata as a stand-alone image format loader. We really don't want to load GdkPixdata files from sources other than generated sources or embedded data structures, and removing that loader closed off those avenues. We still ended up fixing a fair number of naive assumptions in helper functions though.
  • the addition of a thumbnailer for gdk-pixbuf supported images. Images would not be special-cased any more in gnome-desktop's thumbnailing code, making the file manager, the file chooser and anything else navigating directories full of broken and huge images more reliable.
But that's just the start. gdk-pixbuf continues getting bug fixes, and we carry on checking for overflows, underflows and just flows, breaks and beats in general.

Programmatic Thumbellina portrait-maker

Picture, if you will, a website making you download garbage files from the Internet, the ROM dump of a NES cartridge that wasn't properly blown on and digital comic books that you definitely definitely paid for.

That's a nice summary of the security bugs foisted upon GNOME in past year or so, even if, thankfully, we were ahead of the curve in terms of fixing those issues (the GStreamer NSF decoder bug was removed in 2013, the comics backend in evince was rewritten over a period of 2 years and committed in March 2017).

Still, 2 pieces of code were running on pretty much every file downloaded, on purpose or not, from the Internet: Tracker's indexers and the file manager's thumbnailers.

Tracker started protecting itself not long after the NSF vulnerability, even if recent versions of GStreamer weren't vulnerable, as we mentioned.

That left the thumbnailers. Some of those are first party, like the gdk-pixbuf, and those offered by core applications (Evince, Videos), written by GNOME developers (yours truly for both epub/mobi and Nintendo DS).

They're all good quality code I'd vouch for (having written or maintained quite a few of them), but they can rely on third-party libraries (say GStreamer, poppler, or libarchive), have naive or insufficiently defensive code (gdk-pixbuf loaders,  GStreamer plugins) or, worst of all: THIRD-PARTY EXTENSIONS.

There are external plugins and extensions for image formats in gdk-pixbuf, for video and audio formats in GStreamer, and for thumbnailers pretty much anywhere. We can't control those, but the least we can do when they explode in a wet mess is make sure that the toilet door is closed.

Not even Nicholas Cage can handle this Alcatraz

For GNOME 3.26 (and today in git master), the thumbnailer stall will be doubly bolted by a Bubblewrap sandbox and a seccomp blacklist.

This closes a whole vector of attack for the GNOME Desktop, but doesn't mean we're completely out of the woods. We'll need to carry on maintaining and fixing security bugs in those libraries and tools we depend on, as GStreamer plugin bugs still affect Videos, gdk-pixbuf bugs still affect Photos and Eye Of Gnome, etc.

And there are limits to what those 2 changes can achieve. The sandboxing and syscall blacklisting avoids those thumbnailers writing anything but an image file in PNG format in a specific directory. There's no network, the filename of the original file is hidden and sanitised, but the thumbnailer could still create a crafted PNG file, and the sandbox doesn't work inside a sandbox! So no protection if the application running the thumbnailer is inside Flatpak.

In fine

GNOME 3.26 will have better security for thumbnailers, so you won't "need to delete GNOME Files".

But you'll probably want to be careful with desktops that forked our thumbnailing code, namely Cinnamon and MATE, which don't implement those security features.

The next step for the thumbnailers will be beefing up our protection against greedy thumbnailers (in terms of CPU and memory usage), and sharing the code better between thumbnailers.

Note for later, more images of cute animals.
When I discussed this project with my mentor before GSoC, he told me that the button mappings were going to be the most complicated piece. This week I’ve been working on precisely that and, well, let’s just say he wasn’t wrong 😉 If you’ve been following along on GitHub, you’re probably thinking that it was a slow week. Indeed, there hasn’t been that much activity this week as in previous weeks.
July 20, 2017

Since the last post a lot work has gone into upstreaming and stabilizing the etnaviv on Android ecosystem. This has involved Android, kernel and Mesa changes. Many of which are available upstream now. A How-To for getting you up and running on an iMX6 dev board is available here.

Improvements

Modifiers support

Modifiers support has been accepted into Mesa, GBM and gbm_gralloc. Modifiers were mentioned in a previous post.

Etnaviv driver support for Android

Patches enabling the etnaviv Mesa driver being built for Android have now landed upstream.

Stability on Android

A number for small stability issues present while running Android on i.MX6 hardware have now been fixed, and the platform is now relatively stable.

Performance diagnostics

We have a decent understanding that the …

July 17, 2017

Video of my casync Presentation @ kinvolk

The great folks at kinvolk have uploaded a video of my casync presentation at their offices last week.

The slides are available as well.

Enjoy!

This week’s VC5 progress primarily involved rewriting the CLIF-style debug dumps using Intel’s gen_decode.c code. Instead of code-generating a bunch of C functions that print out a struct’s contents, I now have a little bit of C code that parses a compressed version of the XML at runtime to pick apart the struct and dump it. I’ve implemented this on VC4 and VC5, and started the Android build debugging process for it.

I also finished fixing the regressions from my VC5 QIR redesign. We now operate on just QPU instructions with sideband information for register allocation, instead of a higher-level IR (that’s what NIR is for).

For Raspbian performance, I’ve been talking with keithp and others about my window-dragging performance issue. My current plan is to implement a little GL extension that gives glBlitFramebuffer() defined behavior for 1:1 overlapping copies, and then use that from X11 to avoid the temporary. That should cut the cost of the window movement in half (not counting the cost of the drawing caused by the expose events).

On the KMS front, I’ve fixed a regression in dual-display support from my previous tiling work. In the process, I’ve also written a fix for panning, which was broken even before the tiling work. I’ve pushed a fix from Boris for a warning on CRTC enable. I’ve also worked on handling the review feedback from the last DSI series, and started on review of Hans’s VC4 CEC support.

July 14, 2017

I provided in the past few weeks some general information about my project and hopefully helpful documentation for the multiple components I’m working with, but I have not yet talked about the work I’m doing on the code itself. Let’s change this today.

You can find my work branch on GitHub. It’s basically just a personal repository so I can sync my work between my devices, so be warned: The commits are messy as nothing is cleaned up and debug lines as well as temporary TODOs are all over the place. And to be honest up until yesterday my changes didn’t accumulate to much. For some reason no picture was displayed in my two test applications, which are Neverball and VLC.

Then on the weekend suddenly the KWin Wayland session wouldn’t even launch anymore. Well, at least this issue I was able to fix pretty quickly. But there was still no picture, it seemed the presenting just halted after the first buffer was sent to KWin and without any further messages. Neither the Xwayland server nor the client were unresponsive though. Only today I finally could solve the problem thanks to Daniel’s help. The reason for the halt was that I waited on a frame callback from KWin in order to present the next frame. But this never arrived since I hadn’t set any damage in the previous frame and KWin then wouldn’t signal a new frame. I fixed it by adding a generic damage request for now. After that the picture was depicted and moving nicely.

This is definitively the first big milestone with this project. Until now all I achieved was increasing my own knowledge by reading documentation and poking into the code with debug lines. Ok, I also added some code I hoped would make sense, but besides the compilation there was no feedback through a working prototype to see if my code was going in the right direction or if it was utter bollocks. But after today I can say that my buffer flipping and committing code at least produces a picture. And when looking at the FPS counter in Neverball I would even say, that the buffer flipping replacing all the buffer copies already improved the frame rate.

But to test this I first had to solve another problem: The frame rate was always limited to the 60 Hz of my display. The reason was simple: I called present_event_notify only on the frame callback, but in the Xwayland case we can call it directly after the buffer has been sent to the compositor. The only problem I see with this is that the Present extension assumes, that after a new Pixmap has been flipped the old one can be instantly set ready to be used again for new rendering content. But if the last Pixmap’s buffer is still used by the compositor in some way this can lead to tearing.

This hints to a fundamental issue with our approach of using the Present extension in Xwayland. The extension was written with hardware in mind. It assumes a flip happens directly on a screen. There is no intermediate link like a Wayland compositor and if a flip has happened the old buffer is not on the screen anymore. Why do we still try to leverage the Present extension support in Xwayland then? There are two important features of a Wayland compositor we want to have with Xwayland: A tear-free experience for the user and the ability to output a buffer rendered by a direct rendering client on a hardware plane without any copies in between. Every frame is perfect should also remain valid when using some legacy application and that we want no unnecessary copies is simply a question of performance improvements. This is especially important for many of the more demanding games out there, which won’t be Wayland native in the short term and some of them maybe never. Both features need the the full Present extension support in the Xwayland DDX. Without it a direct rendering application would still use the Present extension but only with its fallback code path of copying the Pixmap’s content. And for a tear-free experience we would at least need to sync these copies to the frame events sent by the Wayland compositor or better directly allow multiple buffers, otherwise we would limit our frame rate. In both cases this means again to increase the Present extension support.

I plan on writing about the Present extension in detail in the next week. So if you didn’t fully understand some of the concepts I talked about in this post it could be a good idea to check back.