April 23, 2018

As part of preparing my last two talks at LCA on the kernel community, “Burning Down the Castle” and “Maintainers Don’t Scale”, I have looked into how the Kernel’s maintainer structure can be measured. One very interesting approach is looking at the pull request flows, for example done in the LWN article “How 4.4’s patches got to the mainline”. Note that in the linux kernel process, pull requests are only used to submit development from entire subsystems, not individual contributions. What I’m trying to work out here isn’t so much the overall patch flow, but focusing on how maintainers work, and how that’s different in different subsystems.


In my presentations I claimed that the kernel community is suffering from too steep hierarchies. And worse, the people in power don’t bother to apply the same rules to themselves as anyone else, especially around purported quality enforcement tools like code reviews.

For our purposes a contributor is someone who submits a patch to a mailing list, but needs a maintainer to apply it for them, to get the patch merged. A maintainer on the other hand can directly apply a patch to a subsystem tree, and will then send pull requests up the maintainer hierarchy until the patch lands in Linus’ tree. This is relatively easy to measure accurately in git: If the recorded patch author and committer match, it’s a maintainer self-commit, if they don’t match it’s a contributor commit.

There’s a few annoying special cases to handle:

  • Some people use different email addresses or spellings, and sometimes MTAs, patchwork and other tools used in the patch flow chain mangle things further. This could be fixed up with the mail mapping database that LWN for example uses to generate its contributor statistics. Since most maintainers have reasonable setups it doesn’t seem to matter much, hence I decided not to bother.

  • There are subsystems not maintained in git, but in the quilt patch management system. Andrew Morton’s tree is the only one I’m aware of, and I hacked up my scripts to handle this case. After that I realized it doesn’t matter, since Andrew merged exceedingly few of his own patches himself, most have been fixups that landed through other trees.

Also note that this is a property of each commit - the same person can be both a maintainer and a contributor, depending upon how each of their patches gets merged.

The ratio of maintainer self-commits compared to overall commits then gives us a crude, but fairly useful metric to measure how steep the kernel community overall is organized.

Measuring review is much harder. For contributor commits review is not recorded consistently. Many maintainers forgo adding an explicit Reviewed-by tag since they’re adding their own Signed-off-by tag anyway. And since that’s required for all contributor commits, it’s impossible to tell whether a patch has seen formal review before merging. A reasonable assumption though is that maintainers actually look at stuff before applying. For a minimal definition of review, “a second person looked at the patch before merging and deemed the patch a good idea” we can assume that merged contributor patches have a review ratio of 100%. Whether that’s a full formal review or not can unfortunately not be measured with the available data.

A different story is maintainer self-commits - if there is no tag indicating review by someone else, then either it didn’t happen, or the maintainer felt it’s not important enough work to justify the minimal effort to record it. Either way, a patch where the git author and committer match, and which sports no review tags in the commit message, strongly suggests it has indeed seen none.

An objection would be that these patches get reviewed by the next maintainer up, when the pull request gets merged. But there’s well over a thousand such patches each kernel release, and most of the pull requests containing them go directly to Linus in the 2 week long merge window, when the over 10k feature patches of each kernel release land in the mainline branch. It is unrealistic to assume that Linus carefully reviews hundreds of patches himself in just those 2 weeks, while getting hammered by pull requests all around. Similar considerations apply at a subsystem level.

For counting reviews I looked at anything that indicates some kind of patch review, even very informal ones, to stay consistent with the implied oversight the maintainer’s Signed-off-by line provides for merged contributor patches. I therefore included both Reviewed-by and Acked-by tags, including a plethora of misspelled and combined versions of the same.

The scripts also keep track of how pull requests percolate up the hierarchy, which allows filtering on a per-subsystem level. Commits in topic branches are accounted to the subsystem that first lands in Linus’ tree. That’s fairly arbitrary, but simplest to implement.

Last few years of GPU subsystem history

Since I’ve pitched the GPU subsystem against the kernel at large in my recent talks, let’s first look at what things look like in graphics:

GPU maintainer commit statistics Fig. 1 GPU total commits, maintainer self-commits and reviewed maintainer self-commits
GPU relative maintainer commit statistics Fig. 2 GPU percentage maintainer self-commits and reviewed maintainer self-commits

In absolute numbers it’s clear that graphics has grown tremendously over the past few years. Much faster than the kernel at large. Depending upon the metric you pick, the GPU subsystem has grown from being 3% of the kernel to about 10% and now trading spots for 2nd largest subsystem with arm-soc and staging (depending who’s got a big pull for that release).

Maintainer commits keep up with GPU subsystem growth

The relative numbers have a different story. First, commit rights and the fairly big roll out of group maintainership we’ve done in the past 2 years aren’t extreme by historical graphics subsystem standards. We’ve always had around 30-40% maintainer self-commits. There’s a bit of a downward trend in the years leading towards v4.4, due to the massive growth of the i915 driver, and our failure to add more maintainers and committers for a few releases. Adding lots more committers and creating bigger maintainer groups from v4.5 on forward, first for the i915 driver, then to cope with the influx of new small drivers, brought us back to the historical trend line.

There’s another dip happening in the last few kernels, due to AMD bringing in a big new team of contributors to upstream. v4.15 was even more pronounced, in that release the entirely rewritten DC display driver for AMD GPUs landed. The AMD team is already using a committer model for their staging and internal trees, but not (yet) committing directly to their upstream branch. There’s a few process holdups, mostly around the CI flow, that need to be fixed first. As soon as that’s done I expect this recent dip will again be over.

In short, even when facing big growth like the GPU subsystem has, it’s very much doable to keep training new maintainers to keep up with the increased demand.

Review of maintainer self-commits established in the GPU subsystem

Looking at relative changes in how consistently maintainer self-commits are reviewed, there’s a clear growth from mostly no review to 80+% of all maintainer self-commits having seen some formal oversight. We didn’t just keep up with the growth, but scaled faster and managed to make review a standard practice. Most of the drivers, and all the core code, are now consistently reviewed. Even for tiny drivers with small to single person teams we’ve managed to pull this off, through combining them into larger teams run with a group maintainership model.

Last few years of kernel w/o GPU history

kernel w/o GPU maintainer commit statistics Fig. 3 kernel w/o GPU maintainer self-commits and reviewed maintainer self-commits
kernel w/o GPU relative maintainer commit statistics Fig. 4 kernel w/o GPU percentage maintainer self-commits and reviewed maintainer self-commits

Kernel w/o graphics is an entirely different story. Overall, review is much less a thing that happens, with only about 30% of all maintainer self-commits having any indication of oversight. The low ratio of maintainer self-commits is why I removed the total commit number from the absolute graph - it would have dwarfed the much more interesting data on self-commits and reviewed self-commits. The positive thing is that there’s at least a consistent, if very small upward trend in maintainer self-commit reviews, both in absolute and relative numbers. But it’s very slow, and will likely take decades until there’s no longer a double standard on review between contributors and maintainers.

Maintainers are not keeping up with the kernel growth overall

Much more worrying is the trend on maintainer self-commits. Both in absolute, and much more in relative numbers, there’s a clear downward trend, going from around 25% to below 15%. This indicates that the kernel community fails to mentor and train new maintainers at a pace sufficient to keep up with growth. Current maintainers are ever more overloaded, leaving ever less time for them to write patches of their own and get them merged.

Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that.

This is a huge contrast to the “everything is getting better, bigger, and the kernel community is very healthy” fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community. Even when ignoring all the issues around conduct that I’ve raised.

It is also a huge contrast to what we’ve experienced in the GPU subsystem since aggressively rolling out group maintainership starting with the v4.5 release; by spreading the bureaucratic side of applying patches over many more people, maintainers have much more time to create their own patches and get them merged. More crucially, experienced maintainers can focus their limited review bandwidth on the big architectural design questions since they won’t get bogged down in the minutiae of every single simple patch.

4.16 by subsystem

Let’s zoom into how this all looks at a subsystem level, looking at just the recently released 4.16 kernel.

Most subsystems have unsustainable maintainer ratios

Trying to come up with a reasonable list of subsystems that have high maintainer commit ratios is tricky; some rather substantial pull requests are essentially just maintainers submitting their own work, giving them an easy 100% score. But of course that’s just an outlier in the larger scope of the kernel overall having a maintainer self-commit ratio of just 15%. To get a more interesting list of subsystems we need to look at only those with a group of regular contributors and more than just 1 maintainer. A fairly arbitrary cut-off of 200 commits or more in total seems to get us there, yielding the following top ten list:

subsystem total commits maintainer self-commits maintainer ratio
GPU 1683 614 36%
KVM 257 91 35%
arm-soc 885 259 29%
linux-media 422 111 26%
tip (x86, core, …) 792 125 16%
linux-pm 201 31 15%
staging 650 61 9%
linux-block 249 20 8%
sound 351 26 7%
powerpc 235 16 7%

In short there’s very few places where it’s easier to become a maintainer than in the already rather low, roughly 15%, the kernel scores overall. Outside of these few subsystems, the only realistic way is to create a new subsystem, somehow get it merged, and become its maintainer. In most subsystems being a maintainer is an elite status, and the historical trends suggest it will only become more so. If this trend isn’t reversed, then maintainer overload will get a lot worse in the coming years.

Of course subsystem maintainers are expected to spend more time reviewing and managing other people’s contribution. When looking at individual maintainers it would be natural to expect a slow decline in their own contributions in patch form, and hence a decline in self-commits. But below them a new set of maintainers should grow and receive mentoring, and those more junior maintainers would focus more on their own work. That sustainable maintainer pipeline seems to not be present in many kernel subsystems, drawing a bleak future for them.

Much more interesting is the review statistics, split up by subsystem. Again we need a cut-off for noise and outliers. The big outliers here are all the pull requests and trees that have seen zero review, not even any Acked-by tags. As long as we only look at positive examples we don’t need to worry about those. A rather low cut-off of at least 10 maintainer self-commits takes care of other random noise:

subsystem total commits maintainer self-commits maintainer review ratio
f2fs 72 12 100%
XFS 105 78 100%
arm64 166 23 91%
GPU 1683 614 83%
linux-mtd 99 12 75%
KVM 257 91 74%
linux-pm 201 31 71%
pci 145 37 65%
remoteproc 19 14 64%
clk 139 14 64%
dma-mapping 63 60 60%

Yes, XFS and f2fs have their shit together. More interesting is how wide the spread in the filesystem code is; there’s a bunch of substantial fs pulls with a review ratio of flat out zero. Not even a single Acked-by. XFS on the other hand insists on full formal review of everything - I spot checked the history a bit. f2fs is a bit of an outlier with 4.16, barely getting above the cut-off. Usually it has fewer patches and would have been excluded.

Everyone not in the top ten taken together has a review ratio of 27%.

Review double standards in many big subsystems

Looking at the big subsystems with multiple maintainers and huge groups of contributors - I picked 500 patches as the cut-off - there’s some really low review ratios: Staging has 7%, networking 9% and tip scores 10%. Only arm-soc is close to the top ten, with 50%, at the 14th position.

Staging having no standard is kinda the point, but the other core subsystems eschewing review is rather worrisome. More than 9 out of 10 maintainer self-commits merged into these core subsystem do not carry any indication that anyone else ever looked at the patch and deemed it a good idea. The only other subsystem with more than 500 commits is the GPU subsystem, at 4th position with a 83% review ratio.

Compared to maintainers overall the review situation is looking a lot less bleak. There’s a sizeable group of subsystems who at least try to make this work, by having similar review criteria for maintainer self-commits than normal contributors. This is also supported by the rather slow, but steady overall increase of reviews when looking at historical trend.

But there’s clearly other subsystems where review only seems to be a gauntlet inflicted on normal contributors, entirely optional for maintainers themselves. Contributors cannot avoid review, because they can’t commit their own patches. When maintainers outright ignore review for most of their patches this creates a clear double standard between maintainers and mere contributors.

One year ago I wrote “Review, not Rocket Science” on how to roll out review in your subsystem. Looking at this data here I can close with an even shorter version:

What would Dave Chinner do?

Thanks a lot to Daniel Stone, Dave Chinner, Eric Anholt, Geoffrey Huntley, Luce Carter and Sean Paul for reading and commenting on drafts of this article.

April 22, 2018


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.

Mmm, a Moving Mesa Midgard Cube

Mmm, a Moving Mesa Midgard Cube

In the last Panfrost status update, a transitory “half-way” driver was presented, with the purpose of easing the transition from a standalone library abstracting the hardware to a full-fledged OpenGL ES driver using the Mesa and Gallium3D infrastructure.

Since then, I’ve completed the transition, creating such a driver, but retaining support for out-of-tree testing.

Almost everything that was exposed with the custom half-way interface is now available through Gallium3D. Attributes, varyings, and uniforms all work. A bit of rasterisation state is supported. Multiframe programs work, as do programs with multiple non-indexed, direct draws per frame.

The result? The GLES test-cube demo from Freedreno runs using the Mali T760 GPU present in my RK3288 laptop, going through the Mesa/Gallium3D stack. Of course, there’s no need to rely on the vendor’s proprietary compilers for shaders – the demo is using shaders from the free, NIR-based Midgard compiler.

Look ma, no blobs!

In the past three weeks since the previous update, all aspects of the project have seen fervent progress, culminating in the above demo. The change list for the core Gallium driver is lengthy but largely routine: abstracting features about the hardware which were already understood and integrating it with Gallium, resolving bugs which are discovered in the process, and repeating until the next GLES test passes normally. Enthusiastic readers can read the code of the driver core on GitLab.

Although numerous bugs were solved in this process, one in particular is worthy of mention: the “tile flicker bug”, notorious to lurkers of our Freenode IRC channel, #panfrost. Present since the first render, this bug resulted in non-deterministic rendering glitches, where particular tiles would display the background colour in lieu of the render itself. The non-deterministic nature had long suggested it was either the result of improper memory management or a race condition, but the precise cause was unknown. Finally, the cause was narrowed down to a race condition between the vertex/tiler jobs responsible for draws, and the fragment job responsible for screen painting. With this cause in mind, a simple fix squashed the bug, hopefully for good; renders are now deterministic and correct. Huge thanks to Rob Clark for letting me use him as a sounding board to solve this.

In terms of decoding the command stream, some miscellaneous GL state has been determined, like some details about tiler memory management, texture descriptors, and shader linkage (attribute and varying metadata). By far, however, the most significant discovery was the operation of blending on Midgard. It’s… well, unique. If I had known how nuanced the encoding was – and how much code it takes to generate from Gallium blend state – I would have postponed decoding like originally planned.

In any event, blending is now understood. Under Midgard, there are two paths in the hardware for blending: the fixed-function fast path, and the programmable slow path, using “blend shaders”. This distinction has been discussed sparsely in Mali documentation, but the conditions for the fast path were not known until now. Without further ado, the fixed-function blending hardware works when:

  • The blend equation is either ADD, SUBTRACT, or REVERSE_SUBTRACT (but not MIN or MAX)
  • The “dominant” blend function is either the source/destination colour/alpha, or the special case of a constant ONE or ZERO (but not a constant colour or anything fancier), or the additive complement thereof.
  • The non-dominant blend function is either identical to the dominant blend function, or one of the constant special cases.

If these conditions are not met, a blend shader is used instead, incurring a presently unknown performance hit.

By dominant and non-dominant modes, I’m essentially referring to the more complex and less complex blend functions respectively, comparing between the functions for the source and the destination. The exact details of the encoding are a little hairy beyond the scope of this post but are included in the corresponding Panfrost headers and the corresponding code in the driver.

In any event, this separation between fixed-function and programmable blending is now more or less understood. Additionally, blend shaders themselves are now intelligible with Connor Abbott’s Midgard disassembler; blend shaders are just normal Midgard shaders, with an identical ISA to vertex and fragment shaders, and will eventually be generated with the existing NIR compiler. With luck, we should be able to reuse code from the NIR compiler for the vc4, an embedded GPU lacking fixed-function hardware for any blending whatsoever. Additionally, blend shaders open up some interesting possibilities; we may be able to enable developers to write blend shaders themselves in GLSL through a vendored GL extension. More practically, blend shaders should enable implementation of all blend modes, as this is ES 3.2 class hardware, as well as presumably logic operations.

Command-stream work aside, the Midgard compiler also saw some miscellaneous improvements. In particular, the mystery surrounding varyings in vertex shaders has finally been cracked. Recall that gl_Position stores are accomplished by writing the screen-space coordinate to the special register r27, and then including a st_vary instruction with the mysterious input register r1 to the appropriate address. At the time, I had (erroneously) assumed that the r27 store was responsible for the write, and the subsequent instruction was a peculiar errata workaround.

New findings shows it is quite the opposite: it is the store instruction that does the store, but it uses the value of r27, not r1 for its input. What does the r1 signify, then? It turns out that two different registers can be used for varying writes, r26 and r27. The register in the store instruction selects between these registers: a value of zero uses r26 whereas a value of one uses r27. Why, then, are there two varying source registers? Midgard is a VLIW architecture, in this case meaning that it can execute two store instructions simultaneously for improved performance. To achieve this parallelism, it needs two source registers, to be able to write two different values to the two varyings.

This new understanding clarifies some previous peculiar disassemblies, as the purpose of writes to r26 are now understood. This discovery would have been easier had r26 not also represented a reference to an embedded constant!

More importantly, it enables us to implement varying stores in the vertex shader, allowing for smoothed demos, like the shading on test-cube, to work. As a bonus, it cleans up the code relating to gl_Position writes, as we now know they can use the same compiler code path as writes to normal varyings.

Besides varyings, the Midgard compiler also saw various improvements, notably including a basic register allocator, crucial for compiling even slightly nontrivial shaders, such as that of the cube.

Beyond Midgard, my personal focus, Bifrost has continued to seen sustained progress. Connor Abbott has continued decoding the new shader ISA, uncovering and adding disassembler support for a few miscellaneous new instructions and in particular branching. Branching under Bifrost is somewhat involved – the relevant disassembler commit added over two hundred lines of code – with semantics differing noticeably from Midgard. He has also begun work porting the panwrap infrastructure for capturing, decoding, and replaying command streams from Midgard to Bifrost, to pave the way for a full port of the driver to Bifrost down the line.

While Connor continues work on his disassembler, Lyude Paul has been working on a Bifrost assembler compatible with the disassembler’s output, a milestone necessary to demonstrate understanding of the instruction set and a useful prerequisite to writing a Bifrost compiler.

Going forward, I plan on cleaning up technical debt accumulated in the driver to improve maintainability, flexibility, and perhaps performance. Additionally, it is perhaps finally time to address the elephant in the command stream room: textures. Prior to this post, there were two major bugs in the driver: the missing tile bug and the texture reading bug. Seeing as the former was finally solved with a bit of persistence, there’s hope for the latter as well.

May the pans frost on.

April 21, 2018

In February after Plasma 5.12 was released we held a meeting on how we want to improve Wayland support in Plasma 5.13. Since its beta is now less than one month away it is time for a status report on what has been achieved and what we still plan to work on.

Also today started a week-long Plasma Sprint in Berlin, what will hopefully accelerate the Wayland work for 5.13. So in order to kick-start the sprint this is a good opportunity to sum up where we stand now.


Let us start with a small change, but with huge implications: the decision to not set the environment variable QT_QPA_PLATFORM to wayland anymore in Plasma’s startup script.

Qt based applications use this environment variable to determine the platform plugin they should load. The environment variable was set to wayland in Plasma’s Wayland session in order to tell Qt based applications that they should act like Wayland native clients. Otherwise they load the default plugin, which is xcb and means that they try to be X clients in a Wayland session.

This also works, thanks to Xwayland, but of course in a Wayland session we want as many applications to be Wayland native clients as possible. That was probably the rationale behind setting the environment variable in the first place. The problem is though, that this is not always possible. While KDE applications are compiled with the Qt Wayland platform plugin, some third-party Qt applications were not. A prominent example is the Telegram desktop client, which would just give up on launch in a Wayland session because of that.

With the change this is no longer a problem. Not being forced in its QT_QPA_PLATFORM environment variable to some unavailable plugin the Telegram binary will just execute using the xcb plugin and therefore run as Xwayland client in our Wayland session.

One drawback is that this now applies to all Qt based applications. While the Plasma processes were adjusted to now be able to select the Wayland plugin themselves based on session information other applications might not although the wayland plugin might be availale and then still run as Xwayland clients. But this problem might go away with Qt 5.11, which is supposed to either change the behavior of QT_QPA_PLATFORM itself or feature a new environment variable such that an application can express preferences for plugins and fallback if to the first supported one by the session.

Martin Flöser, who wrote most of the patches for this change, talked about it and the consequences in his blog as well.


A huge topic on Desktop Wayland was screen recording and sharing. In the past application developers had a single point of entry to write for in order to receive screencasts: the XServer. In Wayland the compositor as Wayland server has replaced the XServer and so an application would need to talk to the compositor if it wants access to screen content.

This rightfully raised the fear that now developers of screencast apps would need to write for every other Wayland compositor a different backend to receive video data. As a spoiler: luckily this won’t be necessary.

So how did we achieve this? First of all support for screencasts had to be added to KWin and KWayland. This was done by Oleg Chernovskiy. While this is still a KWayland specific interface the trick was now to proxy via xdg-desktop-portal and PipeWire. Jan Grulich jumped in and implemented the necessary backend code on the xdg-desktop-portal side.

A screencast app therefore in the future only needs to talk with xdg-desktop-portal and receive video data through PipeWire on Plasma Wayland. Other compositors then will have to add a similar backend to xdg-desktop-portal as it was done by Jan, but the screencast app stays the same.

Configure your mouse

I wrote a system settings module (KCM) for touchpad configuration on Wayland last year. The touchpad KCM had higher priority than the Mouse KCM back then because there was no way to configure anything about a touchpad on Wayland, while there was a small hack in KWin to at least control the mouse speed.

Still this was no long term solution in regards to the Mouse KCM, and so I wrote a libinput based Wayland Mouse KCM similar to the one I wrote for touchpads.

Wayland Mouse KCM

I went one step further and made the Mouse KCM interact with Libinput on X as well. There was some work on this in the Mouse KCM done in the past, but now it features a fitting Ui like on Wayland and uses the same backend abstraction.

Dmabuf-based Wayland buffers

Fredrik Höglund uploaded patches for review to add support for dmabuf-based Wayland buffer sharing. This is a somewhat technical topic and will not directly influence the user experience in 5.13. But it is to see in the context of bigger changes upstream in Wayland, X and Mesa. The keyword here is buffer modifiers. You can read more about them in this article by Daniel Stone.

Per output color correction

Adjusting the colors and overall gamma of displays individually is a feature, which is quite important to some people and is provided in a Plasma X session via KGamma in a somewhat simplistic fashion.

Since I wrote Night Color as a replacement for Redshift in our Wayland session not long ago I was already somewhat involved in the color correction game.

But this game is becoming increasingly more complex: my current solution for per output color correction includes changes to KWayland, KWin, libkscreen, libcolorcorrect and adds a KCM replacing KGamma on Wayland to let the user control it.

Additionally there are different opinions on how this should work in general and some explanations by upstream more confused me than guided me to the one best solution. I will most likely ignore these opinions for the moment and concentrate on the one solution I have at the moment, which might already be sufficient for most people. I believe it will be actually quite nice to use, for example I plan to provide a color curve widget borrowed from Krita to set the color curves via some control points and curve interpolation.

More on 5.13 and beyond

In the context of per output color correction another topic, which I am working on right now, is abstracting our output classes in KWin’s Drm and Virtual backends to the compositing level. This will first enable my color correction code to be nicely integrated and I anticipate in the long term will be even necessary for two other far more important topics: layered rendering and compositing per output, what will improve performance and allow different refresh rates on multi-monitor setups. But these two tasks will need much more time.

Scaling on Wayland can be done per output and while I am no expert on this topic from what I heard because of that and for other reasons scaling should work much better on Wayland than on X. But there is currently one huge drawback in our Wayland session: we can only scale integer factors. To change this David Edmundson has posted patches for review adding support for xdg-output to KWayland and to KWin. This is one step in allowing fractional scaling on Wayland. There is more to do according to Davd and since he takes part in the sprint I hope we can talk about scaling on Wayland extensively in order for me to better understand the current mechanism and what all needs to be changed in order to provide fractional scaling.

At last there is cursor locking, which is in theory supported by KWin, but in practice does not work well in the games I tried it with. I hope to start work on this topic before 5.13, but I will most likely not finish it for 5.13.

So overall there is lots of progress, but still quite some work to do. In this regard I am certain the Plasma Sprint this week will be fruitful. We can discuss problems, exchange knowledge and simply code in unity (no pun intended). If you have questions or feedback that you want us to address at this sprint, feel free to comment this article.

April 20, 2018

On Tuesday April 17 we released the first batch of Solaris 10 patches & patchsets under Solaris 10 Extended Support.  There were a total of 24 Solaris 10 patches, including kernel updates, and 4 patchsets released on MOS!

Solaris 10 Extended Support will run thru January 2021.  Scott Lynn put together a very informative Blog on Solaris 10 Extended Support detailing the benefits that customers can get by purchasing Extended Support for Solaris 10 - see

Those of you that have taken advantage of our previous Extended Support offerings for Solaris 8 and Solaris 9 will notice that we've changed things around a little with Solaris 10 Extended Support; previously we did not publish any updates to the Solaris 10 Recommended Patchsets during the Extended Support period.  This meant that the Recommended Patchsets remained available to all customers with Premier Operating Systems support, as all the patches the patchsets contained had Operating Systems entitlement requirements.

Moving forward with Solaris 10 Extended Support, the decision has been made to continue to update the Recommended Patchsets thru the Solaris 10 Extended Support period.  This means customers that purchase Solaris 10 Extended Support get the benefit of continued Recommended Patchset updates, as patches that meet the criteria for inclusion in the patchsets are released.  During the Solaris 10 Extended Support period, the updates to the Recommended Patchsets will contain patches that require a Solaris 10 Extended Support contract, so the Solaris 10 Recommended Patchsets will also require a Solaris 10 Extended Support contract during this period.

For customers that do not wish to avail of Extended Support and would like to access the last Recommended Patchsets created prior to the beginning of Extended Support for Solaris 10, the January 2018 Critical Patch Updates (CPUs) for Solaris 10 will remain available to those with Premier Operating System Support.

The CPU Patchsets are rebranded versions of the Recommended Patchset on the CPU dates; the patches included in the CPUs are identical to the Recommended Patchset released on those CPU dates, but the CPU READMEs will be updated to reflect their use as CPU resources.  CPU patchsets are archived and are always available via MOS at later dates so that customers can easily align to their desired CPU baseline at any time.  A further benefit that only Solaris 10 Extended Support customers will receive is access to newly created CPU Patchsets for Solaris 10 thru the Extended Support period.

The following table provides a quick reference to the recent Solaris 10 patchsets that have been released, including details of the support contract required to access them:

Patchset Name Patchset Details README Download Support Contract Required Recommended OS Patchset for Solaris 10 SPARC Patchset Details README Download Extended Support Recommended OS Patchset for Solaris 10 x86 Patchset Details README Download Extended Support CPU OS Patchset 2018/04 Solaris 10 SPARC Patchset Details README Download Extended Support CPU OS Patchset 2018/04 Solaris 10 x86 Patchset Details README Download Extended Support CPU OS Patchset 2018/01 Solaris 10 SPARC Patchset Details README


Operating Systems Support CPU OS Patchset 2018/01 Solaris 10 x86 Patchset Details README Download Operating Systems Support
Please reach out to your local sales representative if you wish to get more information on the benefits of purchasing Extended Support for Solaris 10.
April 18, 2018

At long last, we provide the ability to remove a top-level VDEV from a ZFS storage pool in the upcoming Solaris 11.4 Beta refresh release.

For many years, our recommendation was to create a pool based on current capacity requirements and then grow the pool to meet increasing capacity needs by adding VDEVs or by replacing smaller LUNs with larger LUNs. It is trivial to add capacity or replace smaller LUNs with larger LUNs, sometimes with just one simple command.

The simplicity of ZFS is one of its great strengths!

I still recommend the practice of creating a pool that meets current capacity requirements and then adding capacity when needed. If you need to repurpose pool devices in an over-provisioned pool or if you accidentally misconfigure a pool device, you now have the flexibility to resolve these scenarios.

Review the following practical considerations when using this new feature, which should be used as an exception rather than the rule for pool configuration on production systems:

  • An virtual (pseudo device) is created to move the data off the (removed) pool devices so the pool must have enough space to absorb the creation of the pseudo device
  • Only top-level VDEVs can be removed from mirrored or RAIDZ pools
  • Individual devices can be removed from striped pools
  • Pool device misconfigurations can be corrected

A few implementation details in case you were wondering:

  • No additional steps are needed to remap the removed devices
  • Data from the removed devices are allocated to the remaining devices but this is not a way to rebalance all data on pool devices
  • Reads of the reallocated data are done from the pseudo device until those blocks are freed
  • Some levels of indirection are needed to support this operation but they should not impact performance nor increase memory requirements

See the examples below.

Repurpose Pool Devices

The following pool, tank, has low space consumption so one VDEV is removed.

# zpool list tank NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tank 928G 28.1G 900G 3% 1.00x ONLINE # zpool status tank pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 errors: No known data errors - # zpool remove tank mirror-1 # zpool status tank pool: tank state: ONLINE status: One or more devices are being removed. action: Wait for the resilver to complete. Run 'zpool status -v' to see device specific details. scan: resilver in progress since Sun Apr 15 20:58:45 2018 28.1G scanned 3.07G resilvered at 40.9M/s, 21.83% done, 4m35s to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 mirror-1 REMOVING 0 0 0 c1t7d0 REMOVING 0 0 0 c5t3d0 REMOVING 0 0 0 errors: No known data errors

Run the zpool iostat command to verify that data is being written to the remaining VDEV.

# zpool iostat -v tank 5 capacity operations bandwidth pool alloc free read write read write ------------------------- ----- ----- ----- ----- ----- ----- tank 28.1G 900G 9 182 932K 21.3M mirror-0 14.1G 450G 1 182 7.90K 21.3M c3t2d0 - - 0 28 4.79K 21.3M c4t2d0 - - 0 28 3.92K 21.3M mirror-1 - - 8 179 924K 21.2M c1t7d0 - - 1 28 495K 21.2M c5t3d0 - - 1 28 431K 21.2M ------------------------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ------------------------- ----- ----- ----- ----- ----- ----- tank 28.1G 900G 0 967 0 60.0M mirror-0 14.1G 450G 0 967 0 60.0M c3t2d0 - - 0 67 0 60.0M c4t2d0 - - 0 68 0 60.4M mirror-1 - - 0 0 0 0 c1t7d0 - - 0 0 0 0 c5t3d0 - - 0 0 0 0 ------------------------- ----- ----- ----- ----- ----- ----- Misconfigured Pool Device

In this case, a device was intended to be added as a cache device but was added a single device. The problem is identified and resolved.

# zpool status rzpool pool: rzpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 errors: No known data errors # zpool add rzpool c3t3d0 vdev verification failed: use -f to override the following errors: mismatched replication level: pool uses raidz and new vdev is disk Unable to build pool from specified devices: invalid vdev configuration # zpool add -f rzpool c3t3d0 # zpool status rzpool pool: rzpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 errors: No known data errors # zpool remove rzpool c3t3d0 # zpool add rzpool cache c3t3d0 # zpool status rzpool pool: rzpool state: ONLINE scan: resilvered 0 in 1s with 0 errors on Sun Apr 15 21:09:35 2018 config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 cache c3t3d0 ONLINE 0 0 0

In summary, Solaris 11.4 includes a handy new option for repurposing pool devices and resolving pool misconfiguration errors.

April 17, 2018

On January 30, 2018, we released the Oracle Solaris 11.4 Open Beta. It has been a quite successful.

Today, we are announcing that we've refreshed the 11.4 Open Beta. This refresh includes new capabilities and additional bug fixes (over 280 of them) as we drive to the General Availability Release of Oracle Solaris 11.4.

Some new features in this release are:

  • ZFS Device Removal
  • ZFS Scheduled Scrub
  • SMB 3.1.1
  • Oracle Solaris Cluster Compliance checking
  • ssh-ldap-getpubkey

Also, the Oracle Solaris 11.4 Beta refresh includes the changes to mitigate CVE-2017-5753, otherwise known as Spectre Variant 1, for Firefox, the NVIDIA Graphics driver, and the Solaris Kernel (see MOS docs on SPARC and x86 for more information).

Additionally, new bundled software includes, gcc 7.3, libidn2, and qpdf 7.0.0, and more than 45 new bundled software versions.

Before I go further, I have to say:

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.  The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle Corporation.

I want to take a few minutes to address some questions I've been getting that the upcoming release of Oracle Solaris 11.4 has sparked. 

Oracle Solaris 11.4 runs on Oracle SPARC and x86 systems released since 2011, but not on certain older systems that had been supported in Solaris 11.3 and earlier.  Specifically, systems not supported in Oracle Solaris 11.4 include systems based on the SPARC T1, T2, and T3 processors or the SPARC64 VII+ and earlier based “Sun4u” systems such as the SPARC Enterprise M4000.  To allow customers time to migrate to newer hardware we intend to provide critical security fixes as necessary on top of the last SRU delivered for 11.3 for the following year.  These updates will not provide the same level of content as regular SRUs  and are intended solely as a transition vehicle.  Customers using newer hardware are encouraged to update to Oracle Solaris 11.4 and subsequent Oracle Solaris 11 SRUs as soon as practical.

Another question I've been getting quite a bit is about the release frequency and strategy for Oracle Solaris 11.

After much discussion internally and externally, with you, our customers, about our current continuous delivery release strategy, we are going forward with our current strategy with some minor changes:

  • Oracle Solaris 11 update releases will be released every year in approximately the first quarter of our fiscal year (that's June, July, August for most people).
  • New features will be made available as they are ready to ship in whatever is the next available and appropriate delivery vehicle. This could be an SRU, CPU or a new release.
  • Oracle Solaris 11 update releases will contain the following content:
    • All new features previously released in the SRUs between the releases
    • Any new features that are ready at the time of release
    • Free and Open Source Software updates (i.e. new versions of FOSS)
    • End of Features and End of Life hardware.

​This should make our releases more predictable, maintain the reliability you've come to depend on, and provide new features to you rapidly, allowing you to test them and deploy them faster.

Oracle Solaris 11.4 is secure, simple and cloud-ready and compatible with all your existing Oracle Solaris 11.3 and earlier applications.

Go give the latest beta a try. You can download it here.


We've just released Oracle Solaris 11.3 SRU 31. This is the April Critical Patch update and contains some important security fixes as well as enhancements to Oracle Solaris. SRU31 is now available from My Oracle Support Doc ID 2045311.1, or via 'pkg update' from the support repository at .

The following components have been updated to address security issues:

These enhancements have also been added:

Full details of this SRU can be found in My Oracle Support Doc 2385753.1.
For the list of Service Alerts affecting each Oracle Solaris 11.3 SRU, see Important Oracle Solaris 11.3 SRU Issues (Doc ID 2076753.1).

For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:

Sponza rendering

This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).

The following list includes the main features implemented in the demo:

  • Depth pre-pass
  • Forward and deferred rendering paths
  • Anisotropic filtering
  • Shadow mapping with Percentage-Closer Filtering
  • Bump mapping
  • Screen Space Ambient Occlusion (only on the deferred path)
  • Screen Space Reflections (only on the deferred path)
  • Tone mapping
  • Anti-aliasing (FXAA)

I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn’t sure how to scope it. Eventually I decided to write a “frame analysis” post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.

To avoid making the post too dense I won’t go into too much detail while describing each render pass, so don’t expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don’t make any promises though!).

If you’re interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!

Step 0: Culling

This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn’t affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.

In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera’s frustum, which determines the area of the 3D space that is visible to the camera.

The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera’s frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.

The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.

Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn’t change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.

Step 1: Depth pre-pass

This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.

One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won’t eliminate all the instances of the problem in the general case.

Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.

Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don’t even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don’t hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.

Depth pre-pass output

The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That’s why the picture gets brighter as we move further away.

Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.

In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.

Step 2: Shadow map

In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.

I already covered how shadow mapping works in previous series of posts, so if you’re interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).

With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light’s perspective) or in the light (visible from our light’s perspective) and shade it accordingly.

From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera’s and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.

Shadow map

In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.

In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):

Sponza rendering with 1024×1024 shadow map

You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.

Step 3: GBuffer

In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).

What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.

Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:

  1. Normal vector
  2. Diffuse color
  3. Specular color
  4. Position of the fragment from the point of view of the light (for shadow mapping)

It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don’t have dedicated video memory (such as my Intel GPU).

In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera’s projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don’t need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.

Let’s have a look at the various GBuffer textures we produce in this stage:

Normal vectors

GBuffer normal texture

Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera’s view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera’s view direction are blue (positive Z).

It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.

For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:

GBuffer normal texture (bump mapping disabled)

To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:

Bump mapping enabled
Bump mapping disabled

All the extra detail in the columns is the sole result of the bump mapping technique.

Diffuse color

GBuffer diffuse texture

Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn’t implement a lighting pass that considers how the light source interacts with the scene.

Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.

Specular color

GBuffer specular texture

This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.

Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.

For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:

Specular reflection on cloth embroidery

The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).

Fragment positions from Light

GBuffer light-space position texture

Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.

Step 4: Screen Space Ambient Occlusion

With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.

The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:

SSAO disabled

Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:

SSAO enabled

Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.

To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):

SSAO output texture

In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.

Step 6: Lighting pass

Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.

The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.

Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.

We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.

The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.

After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:

Lighting pass output

Step 7: Tone mapping

After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).

To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.

The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.

In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):

Tone mapping disabled
Tone mapping enabled

Step 8: Screen Space Reflections (SSR)

The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.

There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using “Planar Reflections”, which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn’t scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).

So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.

As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don’t have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don’t have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won’t go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.

I should mention that my take on this is fairly basic and doesn’t implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.

The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:

Capturing reflections

I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.

The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.

One small note on the GBuffer storage: because this is a single floating-point value, we don’t necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn’t support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).

Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material’s roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.

Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.

The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample’s it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.

As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.

The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:

Reflection texture

Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.

Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won’t go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.

Considering material roughness

Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.

Alpha blending

Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:

SSR output

Step 9: Anti-aliasing (FXAA)

So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.

In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:

Anti-aliased output

The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.

Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.

Conclusions and source code

So that’s all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it’s been fun writing one myself and I only hope that this was interesting to someone else.

This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.

April 16, 2018

On the vc4 front, I did the investigation of the HDL to determine that the OLED matrix applies before the gamma tables, so we can expose it in the DRM for Android’s color correction. Stefan was also interested in reworking his fencing patches to use syncobjs, so hopefully we can merge those and get DRM HWC support in mainline soon.

I also took a look at a warning we’re seeing when a cursor with a nonzero hotspot goes to the upper left corner of the screen – unfortunately, fixing it properly looks like it’ll be a bit of a rework.

I finally took a moment to port over an etnaviv change to remove the need for a DRM subsystem node in the DT. This was a request from Rob Herring long ago, but etnaviv’s change finally made it clear what we should be doing instead.

For vc5, I stabilized the GPU scheduler work and pushed it to my main branch. I’ve now started working on using the GMP to isolate clients from each other (important for being able to have unprivileged GPU workloads running alongside X, and also for making sure that say, some misbehaving webgl doesn’t trash your X server’s other window contents). Hopefully once this security issue is resolved, I can (finally!) propose merging it to the kernel.

April 13, 2018

Dart iMX 8M

The i.MX6 platform has for the past few years enjoyed a large effort to add upstream support to Linux and surrounding projects. Now it is at the point where nothing is really missing any more. Improvements are still being made, to the graphics driver for i.MX 6, but functionally it is complete.

Etnaviv driver development timeline

The i.MX8 is a different story. The newly introduced platform, with hardware still difficult to get access to, is seeing lots of work being done, but much still remains to be done.

That being said, initial support for the GPU, the Vivante GC7000, is in place and is able to successfully run Wayland/Weston, glmark, etc. This should also mean that running Android ontop of the currently not-quite-upstream stack is …

April 09, 2018

I continued spending time on VC5 in the last two weeks.

First, I’ve ported the driver over to the AMDGPU scheduler. Prior to this, vc4 and vc5’s render jobs get queued to the HW in the order that the GL clients submit them to the kernel. OpenGL requires that jobs within a client effectively happen in that order (though we do some clever rescheduling in userspace to reduce overhead of some render-to-texture workloads due to us being a tiler). However, having submission order to the kernel dictate submission order to the HW means that a single busy client (imagine a crypto miner) will starve your desktop workload, since the desktop has to wait behind all of the bulk-work jobs the other client has submitted.

With the AMDGPU scheduler, each client gets its own serial run queue, and the scheduler picks between them as jobs in the run queues become ready. It also gives us easy support for in-fences on your jobs, one of the requirements for Android. All of this is with a bit less vc5 driver code than I had for my own, inferior scheduler.

Currently I’m making it most of the way through piglit and conformance test runs, before something goes wrong around the time of a GPU reset and the kernel crashes. In the process, I’ve improved the documentation on the scheduler’s API, and hopefully this encourages other drivers to pick it up.

Second, I’ve been working on debugging some issues that may be TLB flushing bugs. On the piglit “longprim” test, we go through overflow memory quickly, and allocating overflow memory involves updating PTEs and then having the GPU read from those in very short order. I see a lot of GPU segfaults on non-writable PTEs where the new overflow BO was allocated just after the last one (so maybe the lookups that happened near the end of the last one pre-fetched some PTEs from our space?). The confusing part is that I keep getting write errors far past where I would have expected any previous PTE lookups to have gone. Yet, outside of this case and maybe a couple of others within piglit and the CTS, we seem to be completely fine at PTE updates.

On the VC4 front, I wrote some docs for what I think the steps are for people that want to connect new DSI panels to Raspberry Pi. I reviewed Stefan’s patches for using the CTM for color correction on Android (promising, except I’m concerned it applies at the wrong stage of the DRM display pipeline), and some of Boris’s work on async updates (simplifying our cursor and async pageflip path). I also reviewed an Intel patch that’s necessary for a core DRM change we want for our SAND display support, and a Mesa patch fixing a regression with the new modifiers code.

April 08, 2018

To reduce the number of bugs filed against libinput consider this a PSA: as of GNOME 3.28, the default click method on touchpads is the 'clickfinger' method (see the libinput documentation, it even has pictures). In short, rather than having a separate left/right button area on the bottom edge of the touchpad, right or middle clicks are now triggered by clicking with 2 or 3 fingers on the touchpad. This is the method macOS has been using for a decade or so.

Prior to 3.28, GNOME used the libinput defaults which vary depending on the hardware (e.g. mac touchpads default to clickfinger, most other touchpads usually button areas). So if you notice that the right button area disappeared after the 3.28 update, either start using clickfinger or reset using the gnome-tweak-tool. There are gsettings commands that achieve the same thing if gnome-tweak-tool is not an option:

$ gsettings range org.gnome.desktop.peripherals.touchpad click-method
$ gsettings get org.gnome.desktop.peripherals.touchpad click-method
$ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas'

For reference, the upstream commit is in gsettings-desktop-schemas.

Note that this only affects so-called ClickPads, touchpads where the entire touchpad is a button. Touchpads with separate physical buttons in front of the touchpad are not affected by any of this.

Over on my home blog, I've written a piece about how to make use of the Python bindings for Solaris Analytics, featuring a monitoring daemon I've written to poke my Solar PV inverter on a regular basis. I've got a link to my github repo with the code, too. I also cover the SMF authorizations you need in order to write to the Stats Store, provide an IPS package manifest, SMF manifest and service method, and a Makefile to drive the whole thing.

While this daemon is written to make use of Solaris Analytics, it should also work just fine on other operating systems.

April 04, 2018

In the last update of the free software Panfrost driver, I unveiled the Midgard shader compiler. In the two weeks since then, I’ve shifted my attention from shaders back to the command stream, the fixed-function part of the pipeline. A shader compiler is only useful if there’s a way to run the shaders, after all!

The basic parts of the command stream have been known since the early days of the project, but in the past weeks, I methodically went through the OpenGL ES 2.0 specification searching for new features, writing test code to iterate the permutations, discovering how the feature is encoded in the command stream, and writing a decoder for it. This tedious process is at the heart of any free graphics driver project, but with patience, it is effective.

Thus, since the previous post, I have decoded the fields corresponding to: framebuffer clear flags, fragment discard hinting, viewports, blend shaders, blending colour masks, antialiasing (MSAA), face culling, depth factor/units, the stencil test, the depth test, depth ranges, dithering, texture channel swizzling, texture compare functions, texture wrap modes, alpha coverage, and attribute/varying types.

That was a doozy!

This marks an important milestone: excepting textures, framebuffer objects, and fancy blend modes, the command stream needed for OpenGL ES 2.0 is almost entirely understood. For context on why those features are presently missing, we have not yet been able to replay a sample with textures or framebuffer objects, presumably due to a bug in the replay infrastructure. Until we can do this, no major work can occur for them. Figuring this bit out is high priority, but work on this area is mixed in with work on other parts of the project, to avoid causing a stall (and a lame blog post in two weeks with nothing to report back). As for fancy blend modes, our hardware has a peculiar design involving programmable blending as well as a fixed-function subset of the usual pipeline. Accordingly, I’m deferring work on this obscure feature until the rest of the driver is mature.

On the bright side, we do understand more than enough to begin work on a real driver. Thus, I cordially present the one and only Half-Way Driver! Trademark pending. Name coined by yours truly about five minutes ago.

The premise for this driver is simple: to verify that our understanding of the hardware is sound, we need to write a driver that is higher level than the simple decoded replays. And of course, we want to write a real driver, within Mesa and using Gallium3D infrastructure; after all, the end-goal of the project is to enable graphics applications to use the hardware with free software. It’s pretty hard to drive the hardware without a driver – I should know.

On the other hand, it is preferable to develop this driver independently of Mesa and Gallium3D, to retain control of the flow of the codebase, to speed up development, and to simplify debugging. Mesa and Gallium3D are large codebases; while this is necessary for production use, the sheer number of lines of code contained becomes a cumbersome burden to early driver development. As an added incentive to avoid building within their infrastructure, Mesa recompiles are somewhat slow with hardware like mine: as stated, I use my, ahem, low-power RK3288 laptop for development. Besides, while I’m still discovering new aspects to the hardware in each development session, I could do without the looming, ever-present risk of upstream merge conflicts.

The solution – the creatively named Half Way Driver – is a driver that is half-way between the opposite development strategies of a replay-driven, independent toy driver versus a mature in-tree Mesa driver. In particularly, the idea is to abstract a working replay into command stream constructors that follow Gallium3D conventions, including the permissively licensed Gallium3D headers themselves. This approach combines the benefits of each side: development is fast and easy, build times are short, and once the codebase is mature, it will be simple to move into Mesa itself and gain, almost for free, support for OpenGL, along with a number of other compatible state trackers. As an intermediate easing step, we may hook into this out-of-tree driver from softpipe, the reference software rasteriser in Gallium3D, progressively replacing software functionality with hardware-accelerated routines as possible.

In any event, this new driver is progressing nicely. At the moment, only clearing uses the native Gallium3D interface; the list of Galliumified functions will expand shortly. On the other hand, with a somewhat lower level interface, corresponding closely to the command stream, the driver supports the basic structures needed for rendering 3D geometry and running shaders. After some debugging, taking advantage of the differential tracing infrastructure originally built up to analyse the blob, the driver is able to support multiple draws over multiple frames, allowing for some cute GPU-accelerated animations!

Granted, by virtue of our capture-replay-decode workflow, the driver is not able to render anything that a previous replay could not, greatly limiting my screenshot opportunities. C’est la vie, je suppose. But hey, trust that seeing multiple triangles with different rendering states drawn in the same frame is quite exciting when you’ve been mashing your head against your keyboard for hours comparing command stream traces that are thousands of lines long.

In total, this work-in-progress brings us much closer to having a real Gallium3D driver, at which point the really fun demos start. (I’m looking at you, es2gears!)

On the shader side, progress continues to be steady. In the course of investigating blending on Midgard, including the truly bizarre “blend shaders” required for nontrivial blend modes, I uncovered a number of new opcodes relating to integers. In particular, the disassembler is now aware of the bitwise operations, which are used in this blend shader. For the compiler, I introduced a few new workarounds, presumably due to hardware errata, whose necessity was uncovered by improvements in the command stream.

For Bifrost shaders, Connor has continued his work decoding the instruction set. Notably, his recent changes enable complete disassembly of simple vertex shaders. In particular, he discovered a space-saving trick involving a nuanced mechanism for encoding certain registers, which disambiguated his previous disassembled shaders. Although he realised this fact earlier on, it’s also worth noting that there are great similarities to Midgard vertex shaders which were uncovered a few weeks ago – good news for when a Bifrost compiler is written! Among other smaller changes, he also introduced support for half-floats (fp16) and half-ints (int16), which implies a new set of instruction opcodes. He has also gathered initial traces of the Bifrost command stream, with an intent of gauging the difficulty in porting the current Midgard driver to Bifrost as well, allowing us to test shaders on the elegant new Gxx chips. In total, understanding of Bifrost progresses well; while Midgard is certainly leading the driver effort, the gap is closing.

In the near future, we’ll be Galliumising the driver. Stay tuned for scenes from our next episode!

March 27, 2018

The VCHI patches for Raspberry Pi are now merged to staging-next, which is a big step forward. It should probe by default on linux-next, though we’ve still got a problem with vchiq_test -f, as Stefan Wahren found. Dave has continued working on the v4l2 driver and hopefully we’ll get to merge over it soon.

After my burst of work on VC4, though, it was time to get back to VC5. I’ve been working on GLES conformance again, fixing regressions created by new tests (one of which would wedge the GPU such that it never recovered), and pushing up to about a 98% pass rate. I also got 7278 up and running, and it’s at about 97% now. There is at least one class of GPU hangs to resolve in it before it should match 7268. Some of the pieces from this VC5/6 effort included:

  • Adding register spilling support
  • Fixed 2101010 support in a few places
  • Fixed early z configuration within a frame
  • Fixed disabling of transform feedback on 7278
  • Fixed setup of large transform feedback outputs
  • Fixed transform feedback output with points (common in the CTS)
  • Fixed some asserts in core Mesa that we were the first to hit
  • Fixed gallium blits to integer textures (TGSI is the worst).
March 20, 2018

We've just released Oracle Solaris 11.3 SRU 30. It provides improvements and bug fixes for Oracle Solaris 11 systems. SRU30 is now available from My Oracle Support Doc ID 2045311.1, or via 'pkg update' from the support repository at .

Some of the noteworthy improvements in this SRU include:

  • IOR framework enhancements to support non-MPxIO environment

  • libmikmod has been updated to

  • Apache Ant has been updated to 1.10.1

The SRU also updates the following components which have security fixes:

  • Wireshark has been updated to 2.4.5

  • HMP has been updated to

  • ISC DHCP has been updated to 4.3.6-S1

Full details of this SRU can be found in My Oracle Support Doc 2373752.1
For the list of Service Alerts affecting each Oracle Solaris 11.3 SRU, see Important Oracle Solaris 11.3 SRU Issues (Doc ID 2076753.1).

For some time now I have been working on and off on a personal project with no other purpose than toying a bit with Vulkan and some rendering and shading techniques. Although I’ll probably write about that at some point, in this post I want to focus on Vulkan’s specialization constants and how they can provide a very visible performance boost when they are used properly, as I had the chance to verify while working on this project.

The concept behind specialization constants is very simple: they allow applications to set the value of a shader constant at run-time. At first sight, this might not look like much, but it can have very important implications for certain shaders. To showcase this, let’s take the following snippet from a fragment shader as a case study:

layout(push_constant) uniform pcb {
   int num_samples;
} PCB;

const int MAX_SAMPLES = 64;
layout(set = 0, binding = 0) uniform SamplesUBO {
   vec3 samples[MAX_SAMPLES];
} S;

void main()
   for(int i = 0; i < PCB.num_samples; ++i) {
      vec3 sample_i = S.samples[i];

That is a snippet taken from a Screen Space Ambient Occlusion shader that I implemented in my project, a popular techinique used in a lot of games, so it represents a real case scenario. As we can see, the process involves a set of vector samples passed to the shader as a UBO that are processed for each fragment in a loop. We have made the maximum number of samples that the shader can use large enough to accomodate a high-quality scenario, but the actual number of samples used in a particular execution will be taken from a push constant uniform, so the application has the option to choose the quality / performance balance it wants to use.

While the code snippet may look trivial enough, let’s see how it interacts with the shader compiler:

The first obvious issue we find with this implementation is that it is preventing loop unrolling to happen because the actual number of samples to use is unknown at shader compile time. At most, the compiler could guess that it can’t be more than 64, but that number of iterations would still be too large for Mesa to unroll the loop in any case. If the application is configured to only use 24 or 32 samples (the value of our push constant uniform at run-time) then that number of iterations would be small enough that Mesa would unroll the loop if that number was known at shader compile time, so in that scenario we would be losing the optimization just because we are using a push constant uniform instead of a constant for the sake of flexibility.

The second issue, which might be less immediately obvious and yet is the most significant one, is the fact that if the shader compiler can tell that the size of the samples array is small enough, then it can promote the UBO array to a push constant. This means that each access to S.samples[i] turns from an expensive memory fetch to a direct register access for each sample. To put this in perspective, if we are rendering to a full HD target using 24 samples per fragment, it means that we would be saving ourselves from doing 1920x1080x24 memory reads per frame for a very visible performance gain. But again, we would be loosing this optimization because we decided to use a push constant uniform.

Vulkan’s specialization constants allow us to get back these performance optmizations without sacrificing the flexibility we implemented in the shader. To do this, the API provides mechanisms to specify the values of the constants at run-time, but before the shader is compiled.

Continuing with the shader snippet we showed above, here is how it can be rewritten to take advantage of specialization constants:

layout (constant_id = 0) const int NUM_SAMPLES = 64;
layout(std140, set = 0, binding = 0) uniform SamplesUBO {
   vec3 samples[NUM_SAMPLES];
} S;

void main()
   for(int i = 0; i < NUM_SAMPLES; ++i) {
      vec3 sample_i = S.samples[i];

We are now informing the shader that we have a specialization constant NUM_SAMPLES, which represents the actual number of samples to use. By default (if the application doesn’t say otherwise), the specialization constant’s value is 64. However, now that we have a specialization constant in place, we can have the application set its value at run-time, like this:

VkSpecializationMapEntry entry = { 0, 0, sizeof(int32_t) };
   VkSpecializationInfo spec_info = {

The application code above sets up specialization constant information for shader consumption at run-time. This is done via an array of VkSpecializationMapEntry entries, each one determining where to fetch the constant value to use for each specialization constant declared in the shader for which we want to override its default value. In our case, we have a single specialization constant (with id 0), and we are taking its value (of integer type) from offset 0 of a buffer. In our case we only have one specialization constant, so our buffer is just the address of the variable holding the constant’s value (config.ssao.num_samples). When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo. At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.

It is important to remark that specialization takes place when we create the pipeline, since that is the only moment at which Vulkan drivers compile shaders. This makes specialization constants particularly useful when we know the value we want to use ahead of starting the rendering loop, for example when we are applying quality settings to shaders. However, If the value of the constant changes frequently, specialization constants are not useful, since they require expensive shader re-compiles every time we want to change their value, and we want to avoid that as much as possible in our rendering loop. Nevertheless, it it is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the redendering loop and just swap pipelines as needed while rendering.


Specialization constants are a straight forward yet powerful way to gain control over how shader compilers optimize your code. In my particular pet project, applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.

Finally, although the above covered specialization constants from the point of view of Vulkan, this is really a feature of the SPIR-V language, so it is also available in OpenGL with the GL_ARB_gl_spirv extension, which is core since OpenGL 4.6.

March 19, 2018

Contributed by: Thejaswini Kodavur

Have you ever wondered if there was a single service that monitors all your other services and makes administration easier? If yes then “SMF goal services”, a new feature of Oracle Solaris 11.4, is here to provide a single, unambiguous, and well-defined point where one can consider the system up and running. You can choose your customized, mission critical services and link them together into a single SMF service in one step. This SMF service is called a goal service. It can be used to monitor the health of your system upon booting up. This makes administration much easier as monitoring each of the services individually is no longer required!

There are two ways in which you can make your services part of a goal service.

1. Using the supplied Goal Service

By default Oracle Solaris 11.4 system provides you a goal service called “svc:/milestone/goals:default”. This goal service has a dependency on the service “svc:/milestone/multi-user-server:default” by default.

You can set your mission critical service to the default goal service as below:

# svcadm goals system/my-critical-service-1:default

Note: This is a set/clear interface. Therefore the above command will clear the dependency from “svc:/milestone/multi-user-server:default”.

In order to set the dependency on both the services use:

# svcadm goals svc:/milestone/multi-user-server:default \ system/my-critical-service-1:default 2. Creating you own Goal Service

Oracle Solaris 11.4 allows you to create your own goal service and set your mission critical services as dependent services. Follow the below steps to create and use a goal service.

# svcbundle -o new-gs.xml -s service-name=milestone/new-gs -s start-method=":true" # cp new-gs.xml /lib/svc/manifest/site/new-gs.xml # svccfg validate /lib/svc/manifest/site/new-gs.xml # svcadm restart svc:/system/manifest-import # svcs new-gs STATE STIME FMRI online 6:03:36 svc:/milestone/new-gs:default
  • To make this SMF service as a goal service, set the property general/goal-service=true:
# svcadm disable svc:/milestone/new-gs:default # svccfg -s svc:/milestone/new-gs:default setprop general/goal-service=true # svcadm enable svc:/milestone/new-gs:default
  • Now you can set dependencies in the newly created goal services using the -g option as below:
# svcadm goals -g svc:/milestone/new-gs:default system/critical-service-1:default \ system/critical-service-2:default

Note: By omitting the -g option without specifying a goal service, you will set the dependency to the system provided default goal service, i.e svc:/milestone/multi-user-server:default.

  • On system boot up if one of your critical services does not come online, then the goal service will go into maintenance state.
# svcs -d milestone/new-gs STATE STIME FMRI disabled 5:54:31 svc:/system/critical-service-2:default online Feb_19 svc:/system/critical-service-1:default # svcs milestone/new-gs STATE STIME FMRI maintenance 5:54:30 svc:/milestone/new-gs:default

Note: You can use -d option of svcs(1) to check the dependencies on your goal service.

  • Once all of the dependent services come online then your goal service will also come online. For goal services to be online, they are expected to have all their dependencies satisfied.
# svcs -d milestone/new-gs STATE STIME FMRI online Feb_19 svc:/system/critical-service-1:default online 5:56:39 svc:/system/critical-service-2:default # svcs milestone/new-gs STATE STIME FMRI online 5:56:39 svc:/milestone/new-gs:default

Note: For more information refer to "Goal Services" in smf(7) and subcommand goal in svcadm(8).

The goal service “milestone/new-gs” is your new single SMF service with which you can monitor all of your other mission critical services!

Thus, Goals Service acts as the headquarters that monitors the rest of your services.

March 18, 2018

In my last update on the Panfrost project, I showed an assembler and disassembler pair for Midgard, the shader architecture for Mali Txxx GPUs. Unfortunately, Midgard assembly is an arcane, unwieldly language, understood by Connor Abbott, myself, and that’s about it besides engineers bound by nondisclosure agreements. You can read the low-level details of the ISA if you’re interested.

In any case, what any driver really needs is not just an assembler but a compiler. Ideally, such a compiler would live in Mesa itself, capable of converting programs written in high level GLSL into an architecture-specific binary.

Such a mammoth task ought to be delayed until after we begin moving the driver into Mesa, through the Gallium3D infrastructure. In any event, back in January I had already begun such a compiler, ingesting NIR, an intermediate representation coincidentally designed by Connor himself. The past few weeks were spent improving and debugging this compiler until it produced correct, reasonably efficient code for both fragment and vertex shaders.

As of last night, I have reached this milestone for simple shaders!

As an example, an input fragment shader written in GLSL might look like:

uniform vec4 uni4;

void main() {
    gl_FragColor = clamp(
        vec4(1.3, 0.2, 0.8, 1.0) - vec4(uni4.z),
        0.0, 1.0);

Through the fully free compiler stack, passed through the free diaassembler for legibility, this yields:

vadd.fadd.sat r0, r26, -r23.zzzz
br_cond.write +0
fconstants 1.3, 0.2, 0.8, 1

vmul.fmov r0, r24.xxxx, r0
br_cond.write -1

This is the optimal compilation for this particular shader; the majority of that shader is the standard fragment epilogue which writes the output colour to the framebuffer.

For some background on the assembly, Midgard is a Very Long Instruction Word (VLIW) architecture. That is, multiple instructions are grouped together in blocks. In the disassembly, this is represented by spacing. Each line is an instruction, and blank lines delimit blocks.

The first instruction contains the entirety of the shader logic. Reading it off, it means “using the vector addition unit, perform the saturated floating point addition of the attached constants (register 26) and the negation of the z component of the uniform (register 23), storing the result into register 0”. It’s very compact, but comparing with the original GLSL, it should be clear where this is coming from. The constants are loaded at the end of the block with the fconstants meta instruction.

The other four instructions are the standard fragment epilogue. We’re not entirely sure why it’s so strange – framebuffer writes are fixed from the result of register 0, and are accomplished with a special loop using branching instruction. We’re also not sure why the redundant move is necessary; Connor and I suspect there may be a hardware limitation or errata preventing a br_cond.write instruction from standing alone in a block. Thankfully, we do understand more or less what’s going on, and they appear to be fixed. The compiler is able to generate it just fine, including optimising the code to write into register 0.

As for vertex shaders, well, fragment shaders are simpler than vertex shaders. Whereas the former merely has the aforementioned weird instruction sequence, vertex epilogues need to handle perspective division and viewport scaling, operations which are not implemented in hardware on this embedded GPU. When this is fully implemented, it will be quite a bit more difficult-to-optimise code in the output, although even the vendor compiler does not seem to optimise it. (Perhaps in time our vertex shaders could be faster than the vendor’s compiled shader due to a smarter epilogue!)

Without further ado, an example vertex shader looks like:

attribute vec4 vin;
uniform vec4 u;

void main() {
    gl_Position = (vin + u.xxxx * vec4(0.01, -0.02, 0.0, 0.0)) * (1.0 / u.x);

Through the same stack and a stub vertex epilogue which assumes there is no perspective division needed (that the input is normalised device coordinates) and that the framebuffer happens to be the resolution 400x240, the compiler emits:

vmul.fmov r1, r24.xxxx, r26
fconstants 0, 0, 0, 0

ld_attr_32 r2, 0, 0x1E1E

vmul.fmul r4, r23.xxxx, r26
vadd.fadd r5, r2, r4
fconstants 0.01, -0.02, 0, 0

lut.frcp r6.x, r23.xxxx, #2.61731e-39
fconstants 0.01, -0.02, 0, 0

vmul.fmul r7, r5, r6.xxxx

vmul.fmul r9, r7, r26
fconstants 200, 120, 0.5, 0

vadd.fadd r27, r26, r9
fconstants 200, 120, 0.5, 1

st_vary_32 r1, 0, 0x1E9E

There is a lot of room for improvement here, but for now, the important part is that it does work! The transformed vertex (after scaling) must be written to the special register 27. Currently, a dummy varying store is emitted to workaround what appears to be yet another hardware quirk. (Are you noticing a trend here? GPUs are funky.). The rest of the code should be more or less intelligible by looking at the ISA notes. In the future, we might improve the disassembler to hide some of the internal encoding peculiarities, such as the dummy r24.xxxx and #0 arguments for fmov and frcp instructions respectively.

All in all, the compiler is progressing nicely. It is currently using a simple SSA-based intermediate representation which maps one-to-one with the hardware, minus details about register allocation and VLIW. This architecture will enable us to optimise our code as needed in the future, once we write a register allocators and instruction scheduler. A number of arithmetic (ALU) operations are supported, and although there is much work left to do – including generating texture instructions, which were only decoded a few weeks ago – the design is sound, clocking in at a mere 1500 lines of code.

The best part, of course, is that this is no standalone compiler; it is already sitting in our fork of mesa, using mesa’s infrastructure. When the driver is written, it’ll be ready from day 1. Woohoo!

Source code is available; get it while it’s hot!

Getting the shader compiler to this point was a bigger time sink than anticipated. Nevertheless, we did do a bit of code cleanup in the meanwhile. On the command stream side, I began passing memory-resident structures by name rather than by address, slowly rolling out a basic watermark allocator. This step is revealing potential issues in the understanding of the command stream, preparing us for proper, non-replay-based driver development. Textures still remain elusive, unfortunately. Aside from that, however, much of – if not most – of the command stream is well-understood now. With the help of the shader compiler, basic 3D tests like test-triangle-msoothed are now almost entirely understood and for the most part devoid of magic.

Lyude Paul has been working on code clean-up specifically regarding the build systems. Her goal is to let new contributors play with GPUs, rather than fight with meson and CMake. We’re hoping to attract some more people with low-level programming knowledge and some spare time to pitch in. (Psst! That might mean you! Join us on IRC!)

On a note of administrivia, the project name has been properly changed to Panfrost. For some history, over the summer two driver projects were formed: chai, by me, for Midgard; and BiOpenly, by Lyude et al, for Bifrost. Thanks to Rob Clark’s matchmaking, we found each other and quickly realised that the two GPU architectures had identical command streams; it was only the shader cores that were totally redesigned and led to the rename. Thus, we merged to join efforts, but the new name was never officially decided.

We finally settled on the name “Panfrost”, and our infrastructure is being changed to reflect this. The IRC channel, still on Freenode, now redirects to #panfrost. Additionally rolled out their new GitLab CE instance, of which we are the first users; you can find our repositories at the Panfrost organisation on the fd.o GitLab.

On Monday, our project was discussed in Robert Foss’s talk “Progress in the Embedded GPU Ecosystem”. Foss predicted the drivers would not be ready for another three years.

Somehow, I have a feeling it’ll be much sooner!

March 13, 2018

Back in 2014, I posted Moving Oracle Solaris to LP64 bit by bit describing work we were doing then. In 2015, I provided an update covering Oracle Solaris 11.3 progress on LP64 conversion.

Now that we've released the Oracle Solaris 11.4 Beta to the public you can see the ratio of ILP32 to LP64 programs in /usr/bin and /usr/sbin in the full Oracle Solaris package repositories has dramatically shifted in 11.4:

Release 32-bit 64-bit total Solaris 11.0 1707 (92%) 144 (8%) 1851 Solaris 11.1 1723 (92%) 150 (8%) 1873 Solaris 11.2 1652 (86%) 271 (14%) 1923 Solaris 11.3 1603 (80%) 379 (19%) 1982 Solaris 11.4 169 (9%) 1769 (91%) 1938

That's over 70% more of the commands shipped in the OS which can use ADI to stop buffer overflows on SPARC, take advantage of more registers on x86, have more address space available for ASLR to choose from, are ready for timestamps and dates past 2038, and receive the other benefits of 64-bit software as described in previous blogs.

And while we continue to provide more features for 64-bit programs, such as making ADI support available in the libc malloc, we aren't abandoning 32-bit programs either. A change that just missed our first beta release, but is coming in a later refresh of our public beta will make it easier for 32-bit programs to use file descriptors > 255 with stdio calls, relaxing a long held limitation of the 32-bit Solaris ABI.

This work was years in the making, and over 180 engineers contributed to it in the Solaris organization, plus even more who came before to make all the FOSS projects we ship and the libraries we provide be 64-bit ready so we could make this happen. We thank all of them for making it possible to bring this to you now.

March 12, 2018

It was only a few weeks ago when I posted that the Intel Mesa driver had successfully passed the Khronos OpenGL 4.6 conformance on day one, and now I am very proud that we can announce the same for the Intel Mesa Vulkan 1.1 driver, the new Vulkan API version announced by the Khronos Group last week. Big thanks to Intel for making Linux a first-class citizen for graphics APIs, and specially to Jason Ekstrand, who did most of the Vulkan 1.1 enablement in the driver.

At Igalia we are very proud of being a part of this: on the driver side, we have contributed the implementation of VK_KHR_16bit_storage, numerous bugfixes for issues raised by the Khronos Conformance Test Suite (CTS) and code reviews for some of the new Vulkan 1.1 features developed by Intel. On the CTS side, we have worked with other Khronos members in reviewing and testing additions to the test suite, identifying and providing fixes for issues in the tests as well as developing new tests.

Finally, I’d like to highlight the strong industry adoption of Vulkan: as stated in the Khronos press release, various other hardware vendors have already implemented conformant Vulkan 1.1 drivers, we are also seeing major 3D engines adopting and supporting Vulkan and AAA games that have already shipped with Vulkan-powered graphics. There is no doubt that this is only the beginning and that we will be seeing a lot more of Vulkan in the coming years, so look forward to it!

Vulkan and the Vulkan logo are registered trademarks of the Khronos Group Inc.

This week I wrote a little patch series to get VCHI probing on upstream Raspberry Pi. As we’re building a more normal media stack for the platform, I want to get this upstreamed, and VCHI is at the root of the firmware services for media.

Next step for VCHI upstreaming will be to extract Dave Stevenson’s new VCSM driver and upstream it, which as I understand it lets you do media decode stuff without gpu_mem= settings in the firmware – the firmware will now request memory from Linux, instead of needing a fixed carveout. That driver will also be part of the dma-buf plan for the new v4l2 mem2mem driver he’s been working.

Dave Stevenson has managed to produce a V4L2 mem2mem driver doing video decode/encode. He says it’s still got some bugs, but things look really promising.

In VC4 display, Stefan Schake submitted patches for fixing display plane alpha blending in the DRM hwcomposer for Android, and I’ve merged them to drm-misc-next.

I also rebased my out-of-tree DPI patch, fixed the regression from last year, and submitted patches upstream and downstream (including a downstream overlay). Hopefully this can help other people attach panels to Raspberry Pi.

On the 3D side, I’ve pushed the YUV-import accelerated blit code. We should now be able to display dma-bufs fast in Kodi, whether you’ve got KMS planes or the fallback GL composition.

Also, now that the kernel side has made it to drm-next, I’ve pushed Boris’s patches for vc4 perfmon into Mesa. Now you can use commands like:

apitrace replay application.trace

to examine behavior of your GL applications on the HW. Note that each doing –pdraw level tracing (instead of –pframes) means that each draw call will flush the scene, which is incredibly expensive in terms of memory bandwidth.

March 11, 2018

Alt text

A recording of the talk is available here.


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of Embedded Linux Conference NA, for hosting a great event.

March 09, 2018
This is the first entry in an on-going series. Here's a list of all entries:
  1. What has TableGen ever done for us?
  2. Functional Programming
  3. Bits
  4. Resolving variables
  5. DAGs
  6. to be continued
Anybody who has ever done serious backend work in LLVM has probably developed a love-hate relationship with TableGen. At its best it can be an extremely useful tool that saves a lot of manual work. At its worst, it will drive you mad with bizarre crashes, indecipherable error messages, and generally inscrutable failures to understand what you want from it.

TableGen is an internal tool of the LLVM compiler framework. It implements a domain-specific language that is used to describe many different kinds of structures. These descriptions are translated to read-only data tables that are used by LLVM during compilation.

For example, all of LLVM's intrinsics are described in TableGen files. Additionally, each backend describes its target machine's instructions, register file(s), and more in TableGen files.

The unit of description is the record. At its core, a record is a dictionary of key-value pairs. Additionally, records are typed by their superclass(es), and each record can have a name. So for example, the target machine descriptions typically contain one record for each supported instruction. The name of this record is the name of the enum value which is used to refer to the instruction. A specialized backend in the TableGen tool collects all records that subclass the Instruction class and generates instruction information tables that is used by the C++ code in the backend and the shared codegen infrastructure.

The main point of the TableGen DSL is to provide an ostensibly convenient way to generate a large set of records in a structured fashion that exploits regularities in the target machine architecture. To get an idea of the scope, the X86 backend description contains ~47k records generated by ~62k lines of TableGen. The AMDGPU backend description contains ~39k records generated by ~24k lines of TableGen.

To get an idea of what TableGen looks like, consider this simple example:
def Plain {
  int x = 5;

class Room<string name> {
  string Name = name;
  string WallColor = "white";

def lobby : Room<"Lobby">;

multiclass Floor<int num, string color> {
  let WallColor = color in {
    def _left : Room<num # "_left">;
    def _right : Room<num # "_right">;

defm first_floor : Floor<1, "yellow">;
defm second_floor : Floor
<2, "gray">;
This example defines 6 records in total. If you have an LLVM build around, just run the above through llvm-tblgen to see them for yourself. The first one has name Plain and contains a single value named x of value 5. The other 5 records have Room as a superclass and contain different values for Name and WallColor.

The first of those is the record of name lobby, whose Name value is "Lobby" (note the difference in capitalization) and whose WallColor is "white".

Then there are four records with the names first_floor_left, first_floor_right, second_floor_left, and second_floor_right. Each of those has Room as a superclass, but not Floor. Floor is a multiclass, and multiclasses are not classes (go figure!). Instead, they are simply collections of record prototypes. In this case, Floor has two record prototypes, _left and _right. They are instantiated by each of the defm directives. Note how even though def and defm look quite similar, they are conceptually different: one instantiates the prototypes in a multiclass (or several multiclasses), the other creates a record that may or may not have one or more superclasses.

The Name value of first_floor_left is "1_left" and its WallColor is "yellow", overriding the default. This demonstrates the late-binding nature of TableGen, which is quite useful for modeling exceptions to an otherwise regular structure:
class Foo {
  string salutation = "Hi";
  string message = salutation#", world!";

def : Foo {
salutation = "Hello";
The message of the anonymous record defined by the def-statement is "Hello, world!".

There is much more to TableGen. For example, a particularly surprising but extremely useful feature are the bit sets that are used to describe instruction encodings. But that's for another time.

For now, let me leave you with just one of the many ridiculous inconsistencies in TableGen:
class Tag<int num> {
  int Number = num;

class Test<int num> {
  int Number1 = Tag<5>.Number;
  int Number2 = Tag<num>.Number;
  Tag Tag1 = Tag<5>;
  Tag Tag2 = Tag<num>;

def : Test<5>;
What are the values in the anonymous record? It turns out that Number1 and Number2 are both 5, but Tag1 and Tag2 refer to different records. Tag1 refers to an anonymous record with superclass Tag and Number equal to 5, while Tag2 also refers to an anonymous record, but with the Number equal to an unresolved variable reference.

This clearly doesn't make sense at all and is the kind of thing that sometimes makes you want to just throw it all out of the window and build your own DSL with blackjack and Python hooks. The problem with that kind of approach is that even if the new thing looks nicer initially, it'd probably end up in a similarly messy state after another five years.

So when I ran into several problems like the above recently, I decided to take a deep dive into the internals of TableGen with the hope of just fixing a lot of the mess without reinventing the wheel. Over the next weeks, I plan to write a couple of focused entries on what I've learned and changed, starting with how a simple form of functional programming should be possible in TableGen.
This is the fifth part of a series; see the first part for a table of contents.

With bit sequences, we have already seen one unusual feature of TableGen that is geared towards its specific purpose. DAG nodes are another; they look a bit like S-expressions:
def op1;
def op2;
def i32:

def Example {
  dag x = (op1 $foo, (op2 i32:$bar, "Hi"));
In the example, there are two DAG nodes, represented by a DagInit object in the code. The first node has as its operation the record op1. The operation of a DAG node must be a record, but there are no other restrictions. This node has two children or arguments: the first argument is named foo but has no value. The second argument has no name, but it does have another DAG node as its value.

This second DAG node has the operation op2 and two arguments. The first argument is named bar and has value i32, the second has no name and value "Hi".

DAG nodes can have any number of arguments, and they can be nested arbitrarily. The values of arguments can have any type, at least as far as the TableGen frontend is concerned. So DAGs are an extremely free-form way of representing data, and they are really only given meaning by TableGen backends.

There are three main uses of DAGs:
  1. Describing the operands on machine instructions.
  2. Describing patterns for instruction selection.
  3. Describing register files with something called "set theory".
I have not yet had the opportunity to explore the last point in detail, so I will only give an overview of the first two uses here.

Describing the operands of machine instructions is fairly straightforward at its core, but the details can become quite elaborate.

I will illustrate some of this with the example of the V_ADD_F32 instruction from the AMDGPU backend. V_ADD_F32 is a standard 32-bit floating point addition, at least in its 32-bit-encoded variant, which the backend represents as V_ADD_F32_e32.

Let's take a look at some of the fully resolved records produced by the TableGen frontend:
def V_ADD_F32_e32 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList = (ins VSrc_f32:$src0, VGPR_32:$src1);
  string AsmOperands = "$vdst, $src0, $src1";

def anonymous_503 {    // DAGOperand RegisterOperand VOPDstOperand
  RegisterClass RegClass = VGPR_32;
  string PrintMethod = "printVOPDst";
As you'd expect, there is one out operand. It is named vdst and an anonymous record is used to describe more detailed information such as its register class (a 32-bit general purpose vector register) and the name of a special method for printing the operand in textual assembly output. (The string "printVOPDst" will be used by the backend that generates the bulk of the instruction printer code, and refers to the method AMDGPUInstPrinter::printVOPDst that is implemented manually.)

There are two in operands. src1 is a 32-bit general purpose vector register and requires no special handling, but src0 supports more complex operands as described in the record VSrc_f32 elsewhere.

Also note the string AsmOperands, which is used as a template for the automatically generated instruction printer code. The operand names in that string refer to the names of the operands as defined in the DAG nodes.

This was a nice warmup, but didn't really demonstrate the full power and flexibility of DAG nodes. Let's look at V_ADD_F32_e64, the 64-bit encoded version, which has some additional features: the sign bits of the inputs can be reset or inverted, and the result (output) can be clamped and/or scaled by some fixed constants (0.5, 2, and 4). This will seem familiar to anybody who has worked with the old OpenGL assembly program extensions or with DirectX shader assembly.

The fully resolved records produced by the TableGen frontend are quite a bit more involved:
def V_ADD_F32_e64 {    // Instruction AMDGPUInst ...
  dag OutOperandList = (outs anonymous_503:$vdst);
  dag InOperandList =
    (ins FP32InputMods:$src0_modifiers, VCSrc_f32:$src0,
         FP32InputMods:$src1_modifiers, VCSrc_f32:$src1,
         clampmod:$clamp, omod:$omod);
  string AsmOperands = "$vdst, $src0_modifiers, $src1_modifiers$clamp$omod";
  list<dag> Pattern =
    [(set f32:$vdst, (fadd
      (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers))))];

def FP32InputMods {     // DAGOperand Operand InputMods FPInputMods
  ValueType Type = i32;
string PrintMethod = "printOperandAndFPInputMods";
AsmOperandClass ParserMatchClass = FP32InputModsMatchClass;

def FP32InputModsMatchClass {   // AsmOperandClass FPInputModsMatchClass
  string Name = "RegOrImmWithFP32InputMods";
  string PredicateMethod = "isRegOrImmWithFP32InputMods";
  string ParserMethod = "parseRegOrImmWithFPInputMods";
The out operand hasn't changed, but there are now many more special in operands that describe whether those additional features of the instruction are used.

You can again see how records such as FP32InputMods refer to manually implemented methods. Also note that the AsmOperands string no longer refers to src0 or src1. Instead, the printOperandAndFPInputMods method on src0_modifiers and src1_modifiers will print the source operand together with its sign modifiers. Similarly, the special ParserMethod parseRegOrImmWithFPInputMods will be used by the assembly parser.

This kind of extensibility by combining generic automatically generated code with manually implemented methods is used throughout the TableGen backends for code generation.

Something else is new here: the Pattern. This pattern, together will all the other patterns defined elsewhere, is compiled into a giant domain-specific bytecode that executes during instruction selection to turn the SelectionDAG into machine instructions. Let's take this particular pattern apart:
(set f32:$vdst, (fadd ...))
We will match an fadd selection DAG node that outputs a 32-bit floating point value, and this output will be linked to the out operand vdst. (set, fadd and many others are defined in the target-independent include/llvm/Target/
(fadd (f32 (VOP3Mods0 f32:$src0, i32:$src0_modifiers,
                      i1:$clamp, i32:$omod)),
      (f32 (VOP3Mods f32:$src1, i32:$src1_modifiers)))
Both input operands of the fadd node must be 32-bit floating point values, and they will be handled by complex patterns. Here's one of them:
def VOP3Mods { // ComplexPattern
  string SelectFunc = "SelectVOP3Mods";
  int NumOperands = 2;
As you'd expect, there's a manually implemented SelectVOP3Mods method. Its signature is
bool SelectVOP3Mods(SDValue In, SDValue &Src,
                    SDValue &SrcMods) const;
It can reject the match by returning false, otherwise it pattern matches a single input SelectionDAG node into nodes that will be placed into src1 and src1_modifiers in the particular pattern we were studying.

Patterns can be arbitrarily complex, and they can be defined outside of instructions as well. For example, here's a pattern for generating the S_BFM_B32 instruction, which generates a bitfield mask:
def anonymous_2373anonymous_2371 {    // Pattern Pat ...
  dag PatternToMatch =
    (i32 (shl (i32 (add (i32 (shl 1, i32:$a)), -1)), i32:$b));
  list<dag> ResultInstrs = [(S_BFM_B32 ?:$a, ?:$b)];
The name of this record doesn't matter. The instruction selection TableGen backend simply looks for all records that have Pattern as a superclass. In this case, we match an expression of the form ((1 << a) - 1) << b on 32-bit integers into a single machine instruction.

So far, we've mostly looked at how DAGs are interpreted by some of the key backends of TableGen. As it turns out, most backends generate their DAGs in a fairly static way, but there are some fancier techniques that can be used as well. This post is already quite long though, so we'll look at those in the next post.
March 07, 2018
Vulkan 1.1 was officially released today, and thanks to a big effort by Bas and a lot of shared work from the Intel anv developers, radv is a launch day conformant implementation.

is a link to the conformance results. This is also radv's first time to be officially conformant on Vega GPUs.
is the patch series, it requires a bunch of common anv patches to land first. This stuff should all be landing in Mesa shortly or most likely already will have by the time you read this.

In order to advertise 1.1 you need at least a 4.15 Linux kernel.

Thanks to the all involved in making this happen, including the behind the scenes effort to allow radv to participate in the launch day!

March 06, 2018
This is the fourth part of a series; see the first part for a table of contents.

It's time to look at some of the guts of TableGen itself. TableGen is split into a frontend, which parses the TableGen input, instantiates all the records, resolves variable references, and so on, and many different backends that generate code based on the instantiated records. In this series I'll be mainly focusing on the frontend, which lives in lib/TableGen/ inside the LLVM repository, e.g. here on the GitHub mirror. The backends for LLVM itself live in utils/TableGen/, together with the command line tool's main() function. Clang also has its own backends.

Let's revisit what kind of variable references there are and what kind of resolving needs to be done with an example:
class Foo<int src> {
  int Src = src;
  int Offset = 1;
  int Dst = !add(Src, Offset);

multiclass Foos<int src> {
  def a : Foo<src>;
  let Offset = 2 in
  def b : Foo<src>;

foreach i = 0-3 in
defm F#i : Foos<i>;
This is actually broken in older LLVM by one of the many bugs, but clearly it should work based on what kind of features are generally available, and with my patch series it certainly does work in the natural way. We see four kinds of variable references:
  • internally within a record, such as the initializer of Dst referencing Src and Offset
  • to a class template variable, such as Src being initialized by src
  • to a multiclass template variable, such as src being passed as a template argument for Foo
  • to a foreach iteration variable
As an aside, keep in mind that let in TableGen does not mean the same thing as in the many functional programming languages that have a similar construct. In those languages let introduces a new variable, but TableGen's let instead overrides the value of a variable that has already been defined elsewhere. In the example above, the let-statement causes the value of Offset to be changed in the record that was instantiated from the Foo class to create the b prototype inside multiclass Foos.

TableGen internally represents variable references as instances of the VarInit class, and the variables themselves are simply referenced by name. This causes some embarrassing issues around template arguments which are papered over by qualifying the variable name with the template name. If you pass the above example through a sufficiently fixed version of llvm-tblgen, one of the outputs will be the description of the Foo class:
class Foo<int Foo:src = ?> {
  int Src = Foo:src;
  int Offset = 1;
  int Dst = !add(Src, Offset);
  string NAME = ?;
As you can see, Foo:src is used to refer to the template argument. In fact, the template arguments of both classes and multiclasses are temporarily added as variables to their respective prototype records. When the class or prototype in a multiclass is instantiated, all references to the template argument variables are resolved fully, and the variables are removed (or rather, some of them are removed, and making that consistent is one of the many things I set out to clean up).

Similarly, references to foreach iteration variables are resolved when records are instantiated, although those variables aren't similarly qualified. If you want to learn more about how variable names are looked up, TGParser::ParseIDValue is a good place to start.

The order in which variables are resolved is important. In order to achieve the flexibility of overriding defaults with let-statements, internal references among record variables must be resolved after template arguments.

Actually resolving variable references used to be done by the implementations of the following virtual method of the Init class hierarchy (which represents initializers, i.e. values and expressions):
virtual Init *resolveReferences(Record &R, const RecordVal *RV) const;
This method recursively resolves references in the constituent parts of the expression and then performs constant folding, and returns the resulting value (or the original value if nothing could be resolved). Its interface is somewhat magical: R represents the "current" record which is used as a frame of reference for magical lookups in the implementation of !cast; this is a topic for another time, though. At the same time, variables referencing R are supposed to be resolved, but only if RV is null. If RV is non-null, then only references to that specific variable are supposed to be resolved. Additionally, some behaviors around unset depend on this.

This is replaced in my changes with
virtual Init *resolveReferences(Resolver &R) const;
where Resolver is an abstract base class / interface which can lookup values based on their variable names:
class Resolver {
  Record *CurRec;

  explicit Resolver(Record *CurRec) : CurRec(CurRec) {}
  virtual ~Resolver() {}

  Record *getCurrentRecord() const { return CurRec; }
  virtual Init *resolve(Init *VarName) = 0;
  virtual bool keepUnsetBits() const { return false; }
The "current record" is used as a reference for the aforementioned magical !casts, and keepUnsetBits instructs the implementation of bit sequences in BitsInit not to resolve to ? (as was explained in the third part of the series). resolve itself is implemented by one of the subclasses, most notably:
  1. MapResolver: Resolve based on a dictionary of name-value pairs.
  2. RecordResolver: Resolve variable names that appear in the current record.
  3. ShadowResolver: Delegate requests to an underlying resolver, but filter out some names.
 This last type of resolver is used by the implementations of !foreach and !foldl to avoid mistakes with nesting. Consider, for example:
class Exclamation<list<string> messages> {
  list Messages = !foreach(s, messages, s # "!");

class Greetings<list<string> names>
    : Exclamation&lt!foreach(s, names, "Hello, " # s)>;

def : Greetings<["Alice", "Bob"]>;
This effectively becomes a nested !foreach. The iteration variable is named s in both, so when substituting s for the outer !foreach, we must ensure that we don't also accidentally substitute s in the inner !foreach. We achieve this by having !foreach wrap the given resolver with a ShadowResolver. The same principle applies to !foldl as well, of course.
March 05, 2018

About two weeks ago, I published a screenshot of a smoothed triangle rendered with a free software driver on a Mali T760 with binary shaders.

But… binary shaders? C’mon, I shouldn’t stoop that low! What good is it to have a free software driver if we’re dependant on a proprietary mystery blob to compile our shaders, arguably the most important capability of a modern graphics driver?

There was little excuse – even then the shader instruction set was partially understood through the work of Connor Abbott back in 2013. At the time, Connor decoded the majority of arithmetic (ALU) and load-store instructions; additionally, he wrote a disassembler based on his findings. It is hard to overstate the magnitude of Connor’s contributions here; decoding a modern instruction set like Midgard is a major feat, of comparable difficulty to decoding the GPU’s command stream itself. In any case, though, his work resulted in detailed documentation and a disassembler strictly for prototyping work, never meant for real world use.

Naturally enough, I did the unthinkable, by linking directly to the disassembler’s internal library from the command stream tracer. After cleaning up the disassembler code a bit, massaging its output into normal assembly rather than a collection of notes-to-self, the relevant source code for our smoothed triangle changed from:

FILE *f_shader_12 = fopen("shader_12.bin", "rb");
fread(shader_12, 1, 4096, f_shader_12);

(where shader_12.bin is a nontrivial blob extracted from the command stream containing the compiled shaders as well as some other unused code), to a much more readable:

const char shader_src_2[] = R"(
    ld_vary_16 r0.xy, 0.xyxx, 0xA01E9E

    vmul.fmov r0, r24.xxxx, hr0
    fb.write 0x1808
    vmul.fmov r0, r24.xxxx, r0
    fb.write 0x1FF8

pandev_shader_assemble(shader_12 + 288, shader_src_2);

There are still some mystery hex constants there, but the big parts are understood for fragment shaders at least. Vertex shaders are a little more complicated, but having this disassembly will make those much easier to understand as well.

In any event, having this disassembly embedded into the command stream isn’t any good without an assembler…

…so, of course, I then wrote a Midgard assembler. It’s about five hundred lines of Python, plus Pythonised versions of architecture definitions from the disassembler. This assembler isn’t too pretty or performant, but as long as it works, it’s okay; the real driver will use an emitter written directly in C and bypassing the assembly phase.

Indeed, this assembler, although still incomplete in some areas, works very well for the simple shaders we’re currently experimenting with. In fact, a compiled binary can be disassembled and then reassembled with our tools, yielding bit identical output.

That is, we can be even more reckless and call out to this prototype assembler from within the command stream. Look Ma, no blobs!

There is no magic. Although Midgard assembly is a bit cumbersome, I have been able to write some simple fragment shaders in assembly by hand, using only the free toolchain. Woohoo!

Sadly, while Connor’s 2013-era notes were impressive, they were lacking in a few notable areas; in particularly, he had not made any progress decoding texture words. Similarly, the elusive fbwrite field was never filled in. Not an issue – Connor and I decoded much of the texture pipeline, fbwrite, and branching. Many texture instructions can now be disassembled without unknown words! And of course, for these simpler texture instructions, we can reassemble them again bit-identical.

But we’ve been quite busy. Although the above represents quite a bit of code, that didn’t take the entirety of two weeks, of course. The command stream saw plenty of work, too, but that isn’t quite as glamorous as shaders. I decoded indexed draws, which now appear to work flawlessly. More interestingly, I began work investigating texture and sampler descriptors. A handful of fields are known there, as well as the general structure, although I have not yet successfully replayed any textures, nor have I looked into texture swizzling. Additionally, I identified a number of minor fields relating to: glFrontFace, glLineWidth, attribute and uniform count, framebuffer dimensions, depth/stencil enables, face culling, and vertex counts. Together, I estimate I’ve written about 1k lines of code since the last update, which is pretty crazy.

So, what’s next in the pipeline?

Textures, of course! I’d also like to clean up the command stream replays, particularly relating to memory allocation, to ensure there are no large gaps in our understanding of the hardware.

After that, well, it’ll be time to dust off the NIR compiler I began at the end of January… and start moving code into Mesa!

The future is looking bright for the Panfrost driver.

This week I got the new hardware-accelerated blits for YUV import to GL working again.

The pipeline I have is drm_mmal decoding 360 frames of 1080p Big Buck Bunny trailer using MMAL, importing them to GL as an image_external texture, drawing to the back buffer, and pageflipping.

Switching MMAL from producing linear RGBA buffers to linear NV12 buffers improved FPS by 18.4907% +/- 0.746806% (n=7), and to YV12 by 14.4922% +/- 0.569289%. The improvement is slightly understated, as there’s some fixed overhead of waiting for vcsm to time out to indicate that the stream is complete.

I also polished up Dave’s SAND support patch for KMS, and submitted it. This lets video decode into KMS planes skip a copy of the buffers (I didn’t do performance comparisons of this, though).

Patches are submitted, and the next step will be to implement import of SAND buffers in GL to match the KMS support.

February 28, 2018

We're seeing blogs about the Solaris 11.4 Beta show up through different channels like Twitter and Facebook which means you might have missed some of these, so we thought it would be good do a round-up. This also means you might have already seen some of them but hopefully there are some nice new ones among them.

We hope you enjoy.

After the Raspberry Pi visit, I had a week off to wander around the UK with my partner, and now I’m back.

First, I got to fix regressions in Mesa master on both vc4 and vc5. (Oh, how I wish for non-vendor-specific CI of this project). I also wrote 17 patches to fix various compiler warnings that were driving me nuts.

I refactored my VC4 YUV GL import support, and pulled out the basic copying of the incoming linear data into tiled for the 3D engine to use. This is a CPU-side copy, so it’s really slow due to the uncached read, but it means that you can now import YUV textures using EGL_image_external on Mesa master. Hopefully this can enable Kodi devs to start playing with this on their KMS build.

I’ve also rewritten the hardware-accelerated YUV blit code to hopefully be mergeable. Now I just need to stabilize it.

In VC5 land, I’ve tested and pushed a couple of new fixes to enable 3D textures.

On the kernel side, I’ve merged a bunch of DT and defconfig patches for Pi platform enabling, and sent them upstream. In particular I want to call out Baruch Siach’s firmware GPIO expander patch series, which took several revisions to get accepted (sigh, DT), but will let us do proper Pi3 HDMI hotplug detection and BT power management. Boris’s merged patch to forward-port my I2C fix also apparently fixes some EDID detection on HDMI monitors, which will be good news for people trying to switch to KMS.

February 26, 2018
Edit 2018-02-26: renamed from libevdev-python to python-libevdev. That seems to be a more generic name and easier to package.

Last year, just before the holidays Benjamin Tissoires and I worked on a 'new' project - python-libevdev. This is, unsurprisingly, a Python wrapper to libevdev. It's not exactly new since we took the git tree from 2016 when I was working on it the first time round but this time we whipped it into a better shape. Now it's at the point where I think it has the API it should have, pythonic and very easy to use but still with libevdev as the actual workhorse in the background. It's available via pip3 and should be packaged for your favourite distributions soonish.

Who is this for? Basically anyone who needs to work with the evdev protocol. While C is still a thing, there are many use-cases where Python is a much more sensible choice. The python-libevdev documentation on ReadTheDocs provides a few examples which I'll copy here, just so you get a quick overview. The first example shows how to open a device and then continuously loop through all events, searching for button events:

import libevdev

fd = open('/dev/input/event0', 'rb')
d = libevdev.Device(fd)
if not d.has(libevdev.EV_KEY.BTN_LEFT):
print('This does not look like a mouse device')

# Loop indefinitely while pulling the currently available events off
# the file descriptor
while True:
for e in
if not e.matches(libevdev.EV_KEY):

if e.matches(libevdev.EV_KEY.BTN_LEFT):
print('Left button event')
elif e.matches(libevdev.EV_KEY.BTN_RIGHT):
print('Right button event')
The second example shows how to create a virtual uinput device and send events through that device:

import libevdev
d = libevdev.Device() = 'some test device'

uinput = d.create_uinput_device()
print('new uinput test device at {}'.format(uinput.devnode))
events = [InputEvent(libevdev.EV_REL.REL_X, 1),
InputEvent(libevdev.EV_REL.REL_Y, 1),
InputEvent(libevdev.EV_SYN.SYN_REPORT, 0)]
And finally, if you have a textual or binary representation of events, the evbit function helps to convert it to something useful:

>>> import libevdev
>>> print(libevdev.evbit(0))
>>> print(libevdev.evbit(2))
>>> print(libevdev.evbit(3, 4))
>>> print(libevdev.evbit('EV_ABS'))
>>> print(libevdev.evbit('EV_ABS', 'ABS_X'))
>>> print(libevdev.evbit('ABS_X'))
The latter is particularly helpful if you have a script that needs to analyse event sequences and look for protocol bugs (or hw/fw issues).

More explanations and details are available in the python-libevdev documentation. That doc also answers the question why python-libevdev exists when there's already a python-evdev package. The code is up on github.

February 23, 2018
This is the third part of a series; see the first part for a table of contents.

One of the main backend uses of TableGen is describing target machine instructions, and that includes describing the binary encoding of instructions and their constituents parts. This requires a certain level of bit twiddling, and TableGen supports this with explicit bit (single bit) and bits (fixed-length sequence of bits) types:
class Enc<bits<7> op> {
  bits<10> Encoding;

  let Encoding{9-7} = 5;
  let Encoding{6-0} = op;

def InstA : Enc<0x35>;
def InstB : Enc<0x08>;
... will produce records:
def InstA {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 1, 1, 0, 1, 0, 1 };
  string NAME = ?;
def InstB {     // Enc
  bits<10> Encoding = { 1, 0, 1, 0, 0, 0, 1, 0, 0, 0 };
  string NAME = ?;
So you can quite easily slice and dice bit sequences with curly braces, as long as the indices themselves are constants.

But the real killer feature is that so-called unset initializers, represented by a question mark, aren't fully resolved in bit sequences:
class Enc<bits<3> opcode> {
  bits<8> Encoding;
  bits<3> Operand;

  let Encoding{0} = opcode{2};
  let Encoding{3-1} = Operand;
  let Encoding{5-4} = opcode{1-0};
  let Encoding{7-6} = { 1, 0 };

def InstA : Enc<5>;
... produces a  record:
def InstA {     // Enc
  bits<8> Encoding = { 1, 0, 0, 1, Operand{2}, Operand{1}, Operand{0}, 1 };
  bits<3> Operand = { ?, ?, ? };
  string NAME = ?;
So instead of going ahead and saying, hey, Operand{2} is ?, let's resolve that and plug it into Encoding, TableGen instead keeps the fact that bit 3 of Encoding refers to Operand{2} as part of its data structures.

Together with some additional data, this allows a backend of TableGen to automatically generate code for instruction encoding and decoding (i.e., disassembling). For example, it will create the source for a giant C++ method with signature
uint64_t getBinaryCodeForInstr(const MCInst &MI, /* ... */) const;
which contains a giant constant array with all the fixed bits of each instruction followed by a giant switch statement with cases of the form:
case AMDGPU::S_CMP_EQ_I32:
case AMDGPU::S_CMP_EQ_U32:
case AMDGPU::S_CMP_EQ_U64:
// more cases...
  // op: src0
  op = getMachineOpValue(MI, MI.getOperand(0), Fixups, STI);
  Value |= op & UINT64_C(255);
  // op: src1
  op = getMachineOpValue(MI, MI.getOperand(1), Fixups, STI);
  Value |= (op & UINT64_C(255)) << 8;
The bitmasks and shift values are all derived from the structure of unset bits as in the example above, and some additional data (the operand DAGs) are used to identify the operand index corresponding to TableGen variables like Operand based on their name. For example, here are the relevant parts of the S_CMP_EQ_I32 record generated by the AMDGPU backend's TableGen files:
 def S_CMP_EQ_I32 {      // Instruction (+ other superclasses)
  field bits<32> Inst = { 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, src1{7}, src1{6}, src1{5}, src1{4}, src1{3}, src1{2}, src1{1}, src1{0}, src0{7}, src0{6}, src0{5}, src0{4}, src0{3}, src0{2}, src0{1}, src0{0} };
  dag OutOperandList = (outs);
  dag InOperandList = (ins SSrc_b32:$src0, SSrc_b32:$src1);
<8> src0 = { ?, ?, ?, ?, ?, ?, ?, ? };
<8> src1 = { ?, ?, ?, ?, ?, ?, ?, ? };
  // many more variables...
Note how Inst, which describes the 32-bit encoding as a whole, refers to the TableGen variables src0 and src1. The operand indices used in the calls to MI.getOperand() above are derived from the InOperandList, which contains nodes with the corresponding names. (SSrc_b32 is the name of a record that subclasses RegisterOperand and describes the acceptable operands, such as registers and inline constants.)

Hopefully this helped you appreciate just how convenient TableGen can be. Not resolving the ? in bit sequences is an odd little exception to an otherwise fairly regular language, but the resulting expressive power is clearly worth it. It's something to keep in mind when we discuss how variable references are resolved.
February 21, 2018

We've just released Oracle Solaris 11.3 SRU 29. It contains some important security fixes and enhancements. SRU29 is now available from My Oracle Support Doc ID 2045311.1, or via 'pkg update' from the support repository at .

Features included in this SRU include:

  • libdax support on X86

    • This feature enables the use of DAX query operations on x86 platforms. The ISV and Open Source communities can now develop DAX programs on x86 platforms. The application developed on x86 platforms can be executed on SPARC platform with no modifications and the libdax API will choose DAX operations supported by the platform. The libdax library on x86 uses software emulation and does not require any change in the user developed applications.

  • Oracle VM Server for SPARC has been updated to version

  • The Java 8, Java 7, and Java 6 packages have been updated.

The SRU also updates the following components which have security fixes:

  • p7zip has been updated to 16.02

  • Firefox has been updated to 52.6.0esr

  • ImageMagick has been updated to 6.9.9-30

  • Thunderbird has been updated to 52.6.0

  • libtiff has been updated to 4.0.9

  • Wireshark has been updated to 2.4.4

  • NVIDIA driver has been updated

  • irssi has been updated to 1.0.6

  • BIND has been updated to 9.10.5-P3


Full details of this SRU can be found in My Oracle Support Doc 2361795.1
For the list of Service Alerts affecting each Oracle Solaris 11.3 SRU, see Important Oracle Solaris 11.3 SRU Issues (Doc ID 2076753.1).

This is the second part of a series; see the first part for a table of contents.

When the basic pattern of having classes with variables that are filled in via template arguments or let-statements reaches the limits of its expressiveness, it can become useful to calculate values on the fly. TableGen provides string concatenation out of the box with the paste operator ('#'), and there are built-in functions which can be easily recognized since they start with an exclamation mark, such as !add, !srl, !eq, and !listconcat. There is even an !if-builtin and a somewhat broken and limited !foreach.

There is no way of defining new functions, but there is a pattern that can be used to make up for it: classes with ret-values:
class extractBit<int val, int bitnum> {
  bit ret = !and(!srl(val, bitnum), 1);

class Foo<int val> {
  bit bitFour = extractBit<val, 4>.ret;

def Foo1 : Foo<5>;
def Foo2 : Foo<17>;
This doesn't actually work in LLVM trunk right now because of the deficiencies around anonymous record instantiations that I mentioned in the first part of the series, but after a lot of refactoring and cleanups, I got it to work reliably. It turns out to be an extremely useful tool.

In case you're wondering, this does not support recursion and it's probably better that way. It's possible that TableGen is already accidentally Turing complete, but giving it that power on purpose seems unnecessary and might lead to abuse.

Without recursion, a number of builtin functions are required. There has been a !foreach for a long time, and it is a very odd duck:
def Defs {
  int num;

class Example<list<int> nums> {
  list<int> doubled = !foreach(Defs.num, nums, !add(Defs.num, Defs.num));

def MyNums : Example<[4, 1, 9, -3]>;
In many ways it does what you'd expect, except that having to define a dummy record with a dummy variable in this way is clearly odd and fragile. Until very recently it did not actually support everything you'd think even then, and even with the recent fixes there are plenty of bugs. Clearly, this is how !foreach should look instead:
class Example<list<int> nums> {
  list<int> doubled =
      !foreach(x, nums, !add(x, x));

def MyNums : Example<[4, 1, 9, -3]>;
... and that's what I've implemented.

This ends up being a breaking change (the only one in the whole series, hopefully), but !foreach isn't actually used in upstream LLVM proper anyway, and external projects can easily adapt.

A new feature that I have found very helpful is a fold-left operation:
class Enumeration<list<string> items> {
  list<string> ret = !foldl([], items, lhs, item,
      !listconcat(lhs, [!size(lhs) # ": " # item]));

def MyList : Enumeration<["foo", "bar", "baz"]>;
This produces the following record:
def MyList {    // Enumeration
  list<string> ret = ["0: foo", "1: bar", "2: baz"];
  string NAME = ?;
Needless to say, it was necessary to refactor the TableGen tool very deeply to enable this kind of feature, but I am quite happy with how it ended up.

The title of this entry is "Functional Programming", and in a sense I lied. Functions are not first-class values in TableGen even with my changes, so one of the core features of functional programming is missing. But that's okay: most of what you'd expect to have and actually need is now available in a consistent manner, even if it's still clunkier than in a "real" programming language. And again: making functions first-class would immediately make TableGen Turing complete. Do we really want that?

whistles – Nothing to see here, move along kids.

Hello, Triangle!

Hello, Triangle!

February 20, 2018

Alt text

What we're seeing here is a the LED being fully off (albeit with floating clock and data inputs), drawing somewhere between 0.7-1 mA.

I was quite surprised to see such a high quiescent current.

For the APA102 2020 which has a 20x20mm footprint this is somewhat disappointing, not because it is worse than the normal 50x50 APA102 variants, but rather because the small footprint begs for the IC to be used in wearables and other power consumption sensitive applications.


So this is the very simple setup I was using. It's nothing fancy; a multimeter set to the mA range, connected between the power supply and the APA102 breakout board I happened to have laying around.

Alt text

February 16, 2018
When I designed virgl I added a capability system to pass some info about the host GL to the guest driver along the lines of gallium caps. The design was at the virtio GPU level you have a number of capsets each of which has a max version and max size.

The virgl capset is capset 1 with max version 1 and size 308 bytes.

Until now we've happily been using version 1 at 308 bytes. Recently we decided we wanted to have a v2 at 380 bytes, and the world fell apart.

It turned out there is a bug in the guest kernel driver, it asks the host for a list of capsets and allows guest userspace to retrieve from it. The guest userspace has it's own copy of the struct.

The flow is:
Guest mesa driver gives kernel a caps struct to fill out for capset 1.
Kernel driver asks the host over virtio for latest capset 1 info, max size, version.
Host gives it the max_size, version for capset 1.
Kernel driver asks host to fill out malloced memory of the max_size with the
caps struct.
Kernel driver copies the returned caps struct to userspace, using the size of the returned host struct.

The bug is the last line, it uses the size of the returned host struct which ends up corrupting the guest in the scenario where the host has a capset 1 v2, size 380, but the host is still running old userspace which understands capset v1, size 308.

The 380 bytes gets memcpy over the 308 byte struct and boom.

Now we can fix the kernel to not do this, but we can't upgrade every kernel in an existing VM. So if we allow the virglrenderer process to expose a v2 all older sw will explode unless it is also upgraded which isn't really something you want in a VM world.

I came up with some virglrenderer workarounds, but due to another bug where qemu doesn't reset virglrenderer when it should, there was no way to make it reliable, and things like kexec old kernel from new kernel would blow up.

I decided in the end to bite the bullet and just make capset 2 be a repaired one. Unfortunately this needs patches in all 4 components before it can be used.

1) virglrenderer needs to expose capset 2 with the new version/size to qemu.
2) qemu needs to allow the virtio-gpu to transfer capset 2 as a virgl capset to the host.
3) The kernel on the host needs fixing to make sure we copy the minimum of the host caps and the guest caps into the guest userspace driver, then it needs to
provide a way that guest userspace knows the fixed version is in place.
4) The guest userspace needs to check if the guest kernel has the fix, and then query capset 2 first, and fallback to querying capset 1.

After talking to a few other devs in virgl land, they pointed out we could probably just never add a new version of capset 2, and grow the struct endlessly.

The guest driver would fill out the struct it wants to use with it's copy of default minimum values.
It would then call the kernel ioctl to copy over the host caps. The kernel ioctl would copy the minimum size of the host caps and the guest caps.

In this case if the host has a 400 byte capset 2, and the guest still only has 380 byte capset 2, the new fields from the host won't get copied into the guest struct
and it will be fine.

If the guest has the 400 byte capset 2, but the host only has the 380 byte capset 2, the guest would preinit the extra 20 bytes with it's default values (0 or whatever) and the kernel would only copy 380 bytes into the start of the 400 bytes and leave the extra bytes alone.

Now I just have to got write the patches and confirm it all.

Thanks to Stephane at google for creating the patch that showed how broken it was, and to others in the virgl community who noticed how badly it broke old guests! Now to go write the patches...
February 14, 2018
Hi All,

First of Thank you to everyone who has been sending me PSR test results, I've received well over a 100 reports!

Quite a few testers have reported various issues when enabling PSR, 3 often reported issues are:

  • flickering

  • black screen

  • cursor / input lag

The Intel graphics team has been working on a number of fixes which make PSR work better in various cases. Note we don't expect this to fix it everywhere, but it should get better and work on more devices in the near future.

This is good news, but the bad news is that this means that all the tests people have so very kindly done for me will need to be redone once the new improved PSR code is ready for testing. I will do a new blogpost (and email people who have send me test-reports), when the new
PSR code is ready for people to (re-)test (sorry).


February 13, 2018
zsh: corrupt history file /home/$USER/.zsh_history

Most zsh user will have seen the above line at one time or another.
And it means that re-using your shell history is no longer possible.

Maybe some of it can be recovered, but more than likely some has been lost. And even if nothing important has been lost, you probably don't want to spend any time dealing with this.

Make zsh maintain a backup

Run this snippet in the terminal of your choice.

cat <<EOT>> ~/.zshrc

# Backup and restore ZSH history
strings ~/.zsh_history | sed ':a;N;$!ba;s/\\\\\n//g' | sort | uniq -u > ~/.zsh_history.backup
cat ~/.zsh_history ~/.zsh_history.backup | sed ':a;N;$!ba;s/\\\\\n//g'| sort | uniq > ~/.zsh_history


What does this actually do?

The snippet …

February 12, 2018

As a developer, there are always those projects when it is hard to find a way to go forward.  Drop the project for now and find another project, if only to rest your eyes and find yourself a new insight for the temporarily abandoned project.  This is how I embarked on posix_spawn() as an actual system call you will find in Oracle Solaris 11.4. The original library implementation of posix_spawn() uses vfork(), but why care about the old address space if you are not going to use it? Or, worse, stop all the other threads in the process and don't start them until exec succeeded or when you call exit()?

As I had already written kernel modules for nefarious reason to run executables directly from the kernel, I decided to benchmark the simple "make process, execute /bin/true" against posix_spawn() from the library. Even with two threads, posix_spawn() scaled poorly: additional threads did not allow a large number of additional spawns per second.

Starting a new process

All ways to start a new process need to copy a number of process properties: file descriptors, credentials, priorities, resource controls, etc.

The original way to start a new process is fork(); you will need to mark all the pages as copy-on-write (O(n) in the size of the number of pages in the process) and so this gets more and more expensive when the process get larger and larger. In Solaris we also reserve all the needed swap; a large process calling fork() doubles its swap requirement.

In BSD vfork() was introduced; it borrows the address space and was cheap when it was invented.  In much larger processes with hundreds of threads, it became more and more of bottleneck.  Dynamic linking also throws a spanner in the works: what you can do between vfork() and the final exec() is extremely small.

In the standard universe, posix_spawn() was invented; it was aimed mostly at small embedded systems and a very number of specific actions can be performed before the new executable is run.  As it was part of the standard, Solaris grew its own copy build on top of vfork(). It has, of course, the same problems as vfork() has; but because it is implemented in the library we can be sure we steer clear from all the other vfork() pitfalls.

Native spawn(2) call

The native spawn(2) system introduced in Oracle Solaris 11.4 shares a lot of code with the forkx(2) and execve(2).  It mostly avoids doing those unneeded operations:

  • do not stop all threads
  • do not copy any data about the current executable
  • do not clear all watch points (vfork())
  • do not duplicate the address space (fork())
  • no need to handle shared memory segments
  • do not copy one or more of the threads (fork1/forkall), create a new one instead
  • do not copy all file pointers
  • no need to restart all threads held earlier

The exec() call copies from its own address space but when spawn(2) needs the argument, it is already in a new process.  So early in the spawn(2) system call we copy the environment vector and the arguments and save them away.  The data blob is given to the child and the parent waits until the client is about to return from the system call in the new process or when it decides that it can't actually exec and calls exit instead.

A process can spawn(2) in all its threads and the concurrently is only limited by locks that need to be held shortly when processes are created.

The performance win depends on the application; you won't win anything unless you use posix_spawn(); I was very happy to see that our standard shell is using posix_spawn() to start new processes as do popen(3c) as well as system(3c) so the call is well tested.  The more threads you have, the bigger the win. Stopping a thread is expensive, especially if it hold up in a system call. The world used to stop but now it just continues.

Support in truss(1), mdb(1)

When developing a new system call special attention needs to be given to proc(5) and truss(1) interaction.  The spawn(2) system call is an exception but only because it is much harder to get it right; support is also needed in debuggers or they won't see a new process starting. This includes mdb(1) but also truss(1).  They also need to learn that when spawn(2) succeeds, that they are stopping in a completely different executable; we may also have crossed a privilege boundary, e.g., when spawning su(8) or ping(8).

I spent the end of January gearing up for LCA, where I gave a talk about what I’ve done in Broadcom graphics since my last LCA talk 3 years earlier. Video is here.

(Unfortunately, I failed to notice the time countdown, so I didn’t make it to my fun VC5 demo, which had to be done in the hallway after)

I then spent the first week of February in Cambridge at the Raspberry Pi office working on vc4. The goal was to come up with a plan for switching to at least the “fkms” mode with no regressions, with a route to full KMS by default.

The first step was just fixing regressions for fkms in 4.14. The amusing one was mouse lag, caused by us accidentally syncing mouse updates to vblank, and an old patch to reduce HID device polling to ~60fps having been accidentally dropped in the 4.14 rebase. I think we should be at parity-or-better compared to 4.9 now.

For full KMS, the biggest thing we need to fix is getting media decode / camera capture feeding into both VC4 GL and VC4 KMS. I wrote some magic shader code to turn linear Y/U/V or Y/UV planes into tiled textures on the GPU, so that they can be sampled from using GL_OES_EGL_image_external. The kmscube demo works, and working with Dave Stevenson I got a demo mostly working of H.264 decode of Big Buck Bunny into a texture in GL on X11.

While I was there, Dave kept hammering away at the dma-buf sharing work he’s been doing. Our demo worked by having a vc4 fd create the dma-bufs, and importing that into vcsm (to talk MMAL to) and into the vc4 fd used by Mesa (mmal needs the buffers to meet its own size restrictions, so VC4 GL can’t do the allocations for it). The extra vc4 fd is a bit silly – we should be able to take vcsm buffers and export them to vc4.

Also, if VCSM could do CMA allocations for us, then we could potentially have VCSM take over the role of allocating heap for the firmware, meaning that you wouldn’t need big permanent gpu_mem= memory carveouts in order for camera and video to work.

Finally, on the last day Dave got a bit distracted and put together VC4 HVS support for the SAND tiling modifier. He showed me a demo of BBB H.264 decode directly to KMS on the console, and sent me the patch. I’ll do a little bit of polish, and send it out once I get back from vacation.

We also talked about plans for future work. I need to:

  • Polish and merge the YUV texturing support.
  • Extend the YUV texturing support to import SAND-layout buffers with no extra copies (I suspect this will be higher performance media decode into GL than the closed driver stack offered).
  • Make a (downstream-only) QPU user job submit interface so that John Cox’s HEVC decoder can cooprate with the VC4 driver to do deinterlace. (I have a long term idea of us shipping the deinterlace code as a “firmware” blob from the Linux kernel’s perspective and using that blessed blob to securely do deinterlace in the upstream kernel.
  • Make an interface for the firmware to request a QPU user job submission from VC4, so that the firmware’s fancy camera AWB algorithm can work in the presence of the VC4 driver (right now I believe we fall back to a simpler algorithm on the VPU).
  • Investigate reports of slow PutImage-style uploads from SDL/emulators/etc.

Dave plans to:

  • Finish the VCSM rewrite to export dma-bufs and not need gpu_mem= any more.
  • Make a dma-buf enabled V4L2 mem2mem driver for H.264 decode, JPEG decode, etc. using MMAL and VCSM.

Someone needs to:

  • Use the writeback connector in X to implement rotation (which should be cheaper than using GL to do so).
  • Backdoor the dispmanx library in Raspbian to talk KMS instead when the full vc4 KMS driver is loaded (at least on the console. Maybe with some simple support for X11?).

Finally, other little updates:

  • I ported Mesa to V3D 4.2
  • Fixed some GLES3 conformance bugs for VC5
  • Fixed 3D textures for VC5
  • Worked with Boris on debugging HDMI failures in KMS, and reviewed his patches. Finally the flip_done timeouts should be gone!
February 11, 2018
As is usually the case, I'm long overdue for an update.  So this covers the last six(ish) months or so.  The first part might be old news if you follow phoronix.

Older News

In the last update, I mentioned basic a5xx compute shader support.  Late last year (and landing in the mesa 18.0 branch) I had a chance to revisit compute support for a5xx, and finished:
  • image support
  • shared variable support
  • barriers, which involved some improvements to the ir3 instruction scheduler so barriers could be scheduled in the correct order (ie. for various types of barriers, certain instructions can't be move before/after the related barrier
There were also some semi-related SSBO fixes, and additional r/e of instruction encodings, in particular for barriers (new cat7 group of instructions) and image vs SSBO (where different variation of the cat6 instruction encoding are used for images vs SSBOs).

Also I r/e'd and added support for indirect compute, indirect draw, texture-gather, stencil textures, and ARB_framebuffer_no_attachments on a5xx.  Which brings us pretty close to gles31 support.  And over the holiday break I r/e'd and implemented tiled texture support, because moar fps ;-)

Ilia Mirkin also implemented indirect draw, stencil texture, and ARB_framebuffer_no_attachments for a4xx.  Ilia and Wladimir J. van der Laan also landed a handful of a2xx and a20x fixes.  (But there are more a20x fixes hanging out on a branch which we still need to rebase and merge.)  It is definitely nice seeing older hw, which blob driver has long since dropped support for, getting some attention.

Other News

Not exactly freedreno related, but probably of some interest to freedreno users.. in the 4.14 kernel, my qcom_iommu driver finally landed!  This was the last piece to having the gpu working on a vanilla upstream kernel on the dragonboard 410c.  In addition, the camera driver also landed in 4.14, and venus, the v4l2 mem-to-mem driver for hw video decode/encode landed in 4.13.  (The venus driver also already has support for db820c.)

Fwiw, the v4l2 mem-to-mem driver interface is becoming the defacto standard for hw video decode/encode on SoC's.  GStreamer has had support for a long time now.  And more recently ffmpeg (v3.4) and kodi have gained support:

When I first started on freedreno, qcom support for upstream kernel was pretty dire (ie. I think serial console support might have worked on some ancient SoC).  When I started, the only kernel that I could use to get the gpu running was old downstream msm android kernels (initially 2.6.35, and on later boards 3.4 and 3.10).  The ifc6410 was the first board that I (eventually) could run an upstream kernel (after starting out with an msm-3.4 kernel), and the db410c was the first board I got where I never even used an downstream android kernel.  Initially db410c was upstream kernel with a pile of patches, although the size of the patchset dropped over time.  With db820c, that pattern is repeating again (ie. the patchset is already small enough that I managed to easily rebase it myself for after 4.14).  Linaro and qcom have been working quietly in the background to upstream all the various drivers that something like drm/msm depend on to work (clk, genpd, gpio, i2c, and other lower level platform support).  This is awesome to see, and the linaro/qcom developers behind this progress deserve all the thanks.  Without much fanfare, snapdragon has gone from a hopeless case (from upstream perspective) to one of the better supported platforms!

Thanks to the upstream kernel support, and u-boot/UEFI support which I've mentioned before, Fedora 27 supports db410c out of the box (and the situation should be similar with other distro's that have new enough kernel (and gst/ffmpeg/kodi if you care about hw video decode).  Note that the firmware for db410c (and db820c) has been merged in linux-firmware since that blog post.

More Recent News

More recently, I have been working on a batch of (mostly) compiler related enhancements to improve performance with things that have more complex shaders.  In particular:
  • Switch over to NIR's support for lowering phi-web's to registers, instead of dealing with phi instructions in ir3.  NIR has a much more sophisticated pass for coming out of SSA, which does a better job at avoiding the need to insert extra MOV instructions, although a bunch of RA (register allocation) related fixes were required.  The end result is fewer instructions in resulting shader, and more importantly a reduction in register usage.
  • Using NIR's peephole_select pass to lower if/else, instead of our own pass.  This was a pretty small change (although it took some work to arrive at a decent threshold).  Previously the ir3_nir_lower_if_else pass would try to lower all if/else to select instructions, but in extreme cases this is counter-productive as it increases register pressure.  (Background: in simple cases for a GPU, executing both sides of an if/else and using a select instruction to choose the results makes sense, since GPUs tend to be a SIMT arch, and if you aren't executing both sides, you are stalling threads in a warp that took the opposite direction in the if/else.. but in extreme cases this increases register usage which reduces the # of warps in flight.)  End result was 4x speedup in alu2 benchmark, although in the real world it tends to matter less (ie. most shaders aren't that complex).
  • Better handling of sync flags across basic blocks
  • Better instruction scheduling across basic blocks
  • Better instruction scheduling for SFU instructions (ie. sqrt, rsqrt, sin, cos, etc) to avoid stalls on SFU.
  • R/e and add support for (sat)urate flag flag (to avoid extra sequence of min.f + max.f instructions to clamp a result)
  • And a few other tweaks.
The end results tend to depend on how complex the shaders that a game/benchmark uses.  At the extreme high end, 4x improvement for alu2.  On the other hand, probably doesn't make much difference for older games like xonotic.  Supertuxkart and most of the other gfxbench benchmarks show something along the lines of 10-20% improvement.  Supertuxkart, in particular, with advanced pipeline, the combination of compiler improvements with previous lrz and tiled texture (ie. FD_MESA_DEBUG=lrz,ttile) is a 30% improvement!  Some of the more complex shaders I've been looking at, like shadertoy piano, show 25% improvement on the compiler changes alone.  (Shadertoy isn't likely to benefit from lrz/ttile since it is basically just drawing a quad with all the rendering logic in the fragment shader.)

In other news, things are starting to get interesting for snapdragon 845 (sdm845).  Initial patches for a6xx GPU support have been posted (although I still need to my hands on a6xx hw to start r/e for userspace, so those probably won't be merged soon).  And qcom has drm/msm display support buried away in their msm-4.9 tree (expect to see first round of patches for upstream soon.. it's a lot of code, so expect some refactoring before it is merged, but good to get this process started now).

February 10, 2018

We are happy to announce the support of Oracle GoldenGate 12.2.0.x on Oracle Solaris Cluster 4.3. The support includes all the configurations supported with the previous release of Oracle GoldenGate 12.1.2.x.

For detailed information on installing and configuring the high availability data service for Oracle GoldenGate 12.2, refer to the Oracle Solaris Cluster Data Service for Oracle GoldenGate Guide.

February 09, 2018

For the past few years a clear trend of containerization of applications and services has emerged. Having processes containerized is beneficial in a number of ways. It both improves portability and strengthens security, and if done properly the performance penalty can be low.

In order to further improve security containers are commonly run in virtualized environments. This provides some new challenges in terms of supporting the accelerated graphics usecase.

OpenGL ES implementation

Currently Collabora and Google are implementing OpenGL ES 2.0 support. OpenGL ES 2.0 is the lowest common denominator for many mobile platforms and as such is a requirement for Virgil3D to be viable on the those platforms.

That is is the motivation for making Virgil3D work on OpenGL ES hosts.

How …

February 08, 2018

We've been getting random questions about how to install (Oracle Solaris) packages onto their newly installed Oracle Solaris 11.4 Beta. And of course key is pointing to the appropriate IPS repository.

One of the options is to download the full repository and install it on it's own locally or add this to an existing local repository and then just point the publisher to this local repository. This is mostly used by folks who have a test system/LDom/Kernel Zone where they will probably have one or more local repositories already.

However experience shows that a large percentage of folks testing a beta version like this do this in a VirtualBox instance on their laptop or workstation. And because of this they want to use the Gnome Desktop rather than remotely logging through ssh. So one of the things we do is supply an Oracle VM Template for VirtualBox which already has the solaris-desktop group package installed (officially the group/system/solaris-desktop) so it shows more than the console when started and give you the ability to run desktop like tasks like Firefox and a Terminal. (Btw as per the Release Notes on Runtime Issues there's a glitch with gnome-terminal you might run into and you'd need to run a workaround to get it working.)

For this group of VirtualBox based testers the chances are high that they're not going to have a local repository nearby, especially on a laptop that's moving around. This is where using our central repository at is very useful which is well described in the Oracle Solaris documentation.

However going through this there may be some minor obstacles to clear when using this method that aren't directly part of the process but get in the way when using the VirtualBox installed OVM Template.

First, when using the Firefox browser to request certificates and download certificates and later point to the repository you'll need to have DNS working and depending on the install the DNS client may not yet be enabled. Here's how you check it:

demo@solaris-vbox:~$ svcs dns/client STATE STIME FMRI disabled 5:45:26 svc:/network/dns/client:default

This is fairly simple to solve. First check that the Oracle Solaris instance has correctly picked up the DNS information from VirtualBox in the DHCP process buy looking in /etc/resolv.conf. If that looks good simply enable the dns/client service:

demo@solaris-vbox:~$ sudo svcadm enable dns/client

You'll be asked for your password and then it will be enabled. Note you can also use pfexec(1) instead of sudo(8). This will also check if your user has the appropriate privileges.

You can check if the service is running:

demo@solaris-vbox:~/Downloads$ svcs dns/client STATE STIME FMRI online 10:21:16 svc:/network/dns/client:default

Now DNS is running you should be able to ping

The second gotya is that on the page the Oracle Solaris 11.4 Beta repository is at the very bottom of the list of available repositories and should not be confused with the Oracle Solaris 11 Support repository (to which you may already have requested access) listed at the top of the page.

The same certificate/key pair are used for any of the Oracle Solaris repositories, however in order permit the use of the any existing cert/key pair the license for the Oracle Solaris 11.4 Beta repository must be accepted. This means selecting the 'Request Access' button next to the Solaris 11.4 Beta repository entry.

Once you have the cert/key, or you have accepted the license, then you can configure the beta repository as:

pkg set-publisher -k <your-key> -c <your-cert> -g solaris

With the Virtual Box image the default repository setup includes the 'release' repository. It is best to remove that:

pkg set-publisher -G solaris

This can be performed in one command:

pkg set-publisher -k <your-key> -c <your-cert> -G\ -g solaris

Note that here too you'll need to either use pfexec(1) or sudo(8) again. This should kickoff the pkg(1) command and once it's done you can check it's status with:

demo@solaris-vbox:~/Downloads$ pkg publisher solaris Publisher: solaris Alias: Origin URI: Origin Status: Online SSL Key: /var/pkg/ssl/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx SSL Cert: /var/pkg/ssl/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Cert. Effective Date: January 29, 2018 at 03:04:58 PM Cert. Expiration Date: February 6, 2020 at 03:04:58 PM Client UUID: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Catalog Updated: January 24, 2018 at 02:09:16 PM Enabled: Yes

And now you're up and running.

A final thought, if for example you've chosen to install the Text Install version of the Oracle Solaris 11.4 Beta because you want yo have a nice minimal install with no overhead of Gnome and things like that, you can also download the key and certificate to another system or the hosting OS (in the case you're using VirtualBox) and then rsync or rcp them across and then follow all the same steps.

February 05, 2018

The number one use case for live migration today is for evacuation: when a Solaris Zones host needs some maintenance operation that involves a reboot, then the zones are live migrated to some other willing host. This avoids scheduling simultaneous maintenance windows for all the services provided by those zones.

Implementing this today on Solaris 11.3 involves manually migrating zones with individual zoneadm migrate commands, and especially, determining suitable destinations for each of the zones. To make this common scenario simpler and less error prone, Solaris 11.4 Beta comes with a new command sysadm(8) for system maintenance that also allows for zone evacuation.

The basic idea of how it is supposed to be used is like this:

# pkg update ... # sysadm maintain -s -m "updating to new build" # sysadm evacuate -v Evacuating 3 zones... Migrating myzone1 to rads://destination1/ ... Migrating myzone3 to rads://destination1/ ... Migrating myzone4 to rads://destination2/ ... Done in 3m30s. # reboot ... # sysadm maintain -e # sysadm evacuate -r ...

When in maintenance mode, an attempt to attach or boot any zone is refused: if the admin is trying to move zones off the host, it's not helpful to allow incoming zones. Note that this maintenance mode is recorded system-wide, not just in the zones framework; even though the only current impact is on zones, it seems likely other sub-systems may find it useful in the future.

To set up an evacuation target for a zone, an SMF property evacuation/target for a given zone service instance system/zones/zone:<zone-name> must be set to the target host. You can either use rads:// or ssh:// location identifier, e.g.: ssh:// Do not forget to refresh the service instance for your change to take effect.

You can evacuate running Kernel Zones and both installed native and Kernel Zones. The evacuation always means evacuating running zones, and with the option -a, installed zones are included as well. Only those zones with the set evacuation/target property in their service instance are scheduled for evacuation. However, if any of the running zone (and also installed if the evacuate -a is used) is not set with the property, the overall result of the evacuation will be reported as failed by sysadm which is logical as an evacuation by its definition means to evacuate everything.

As live zone migration does not support native zones, those can be only evacuated in the installed state. Also note that you can only evacuate zones installed on shared storage. For example, on iSCSI volumes. See the storage URI manual page, suri(7), for information on what other shared storage is supported. Note that you can install Kernel Zones to NFS files as well.

To setup live Kernel Zone migration, please check out Migrating an Oracle Solaris Kernel Zone section of the 11.4 online documentation.

Now, let's see a real example. We have a few zones on host nacaozumbi. All running and installed zones are on shared storage, including the native zone tzone1 and Kernel Zone evac1:

root:nacaozumbi:~# zonecfg -z tzone1 info rootzpool rootzpool: storage: iscsi://saison/luname.naa.600144f0dbf8af1900005582f1c90007 root:nacaozumbi::~$ zonecfg -z evac1 info device device: storage: iscsi://saison/luname.naa.600144f0dbf8af19000058ff48060017 id: 1 bootpri: 0 root:nacaozumbi:~# zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / solaris shared 82 evac3 running - solaris-kz excl 83 evac1 running - solaris-kz excl 84 evac2 running - solaris-kz excl - tzone1 installed /system/zones/tzone1 solaris excl - on-fixes configured - solaris-kz excl - evac4 installed - solaris-kz excl - zts configured - solaris-kz excl

Zones not set for evacution were detached - ie. on-fixes and zts. All running and installed zones are set to be evacuated to bjork, for example:

root:nacaozumbi:~# svccfg -s system/zones/zone:evac1 listprop evacuation/target evacuation/target astring ssh://root@bjork

Now, let's start the maintenance window:

root:nacaozumbi:~# sysadm maintain -s -m "updating to new build" root:nacaozumbi:~# sysadm maintain -l TYPE USER DATE MESSAGE admin root 2018-02-02 01:10 updating to new build

At this point we can no longer boot or attach zones on nacaozumbi:

root:nacaozumbi:~# zoneadm -z on-fixes attach zoneadm: zone 'on-fixes': attach prevented due to system maintenance: see sysadm(8)

And that also includes migrating zones to nacaozumbi:

root:bjork:~# zoneadm -z on-fixes migrate ssh://root@nacaozumbi zoneadm: zone 'on-fixes': Using existing zone configuration on destination. zoneadm: zone 'on-fixes': Attaching zone. zoneadm: zone 'on-fixes': attach failed: zoneadm: zone 'on-fixes': attach prevented due to system maintenance: see sysadm(8)

Now we start evacuating all the zones. In this example, all running and installed zones have their service instance property evacuation/target set. The option -a means all the zones, that is including those installed. The -v option provides verbose output.

root:nacaozumbi:~# sysadm evacuate -va sysadm: preparing 5 zone(s) for evacuation ... sysadm: initializing migration of evac1 to bjork ... sysadm: initializing migration of evac3 to bjork ... sysadm: initializing migration of evac4 to bjork ... sysadm: initializing migration of tzone1 to bjork ... sysadm: initializing migration of evac2 to bjork ... sysadm: evacuating 5 zone(s) ... sysadm: migrating tzone1 to bjork ... sysadm: migrating evac2 to bjork ... sysadm: migrating evac4 to bjork ... sysadm: migrating evac1 to bjork ... sysadm: migrating evac3 to bjork ... sysadm: evacuation completed successfully. sysadm: evac1: evacuated to ssh://root@bjork sysadm: evac2: evacuated to ssh://root@bjork sysadm: evac3: evacuated to ssh://root@bjork sysadm: evac4: evacuated to ssh://root@bjork sysadm: tzone1: evacuated to ssh://root@bjork

While being evacuated, you can check the state of evacuation like this:

root:nacaozumbi:~# sysadm evacuate -l sysadm: evacuation in progress

After the evacuation is done, you can also see the details like this (for example, in case you did not run it in verbose mode):

root:nacaozumbi:~# sysadm evacuate -l -o ZONENAME,STATE,DEST ZONENAME STATE DEST evac1 EVACUATED ssh://root@bjork evac2 EVACUATED ssh://root@bjork evac3 EVACUATED ssh://root@bjork evac4 EVACUATED ssh://root@bjork tzone1 EVACUATED ssh://root@bjork

And you can see all the evacuated zones are now in the configured state on the source host:

root:nacaozumbi:~# zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / solaris shared - tzone1 configured /system/zones/tzone1 solaris excl - evac1 configured - solaris-kz excl - on-fixes configured - solaris-kz excl - evac4 configured - solaris-kz excl - zts configured - solaris-kz excl - evac3 configured - solaris-kz excl - evac2 configured - solaris-kz excl

And the migrated zones are happily running or in the installed state on host bjork:

jpechane:bjork::~$ zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / solaris shared 57 evac3 running - solaris-kz excl 58 evac1 running - solaris-kz excl 59 evac2 running - solaris-kz excl - on-fixes installed - solaris-kz excl - tzone1 installed /system/zones/tzone1 solaris excl - zts installed - solaris-kz excl - evac4 installed - solaris-kz excl

The maintenance state is still held at this point:

root:nacaozumbi:~# sysadm maintain -l TYPE USER DATE MESSAGE admin root 2018-02-02 01:10 updating to new build

Upgrade the system with a new boot environment unless you did that before (which you should have to keep the time your zones are running on the other host to minimum):

root:nacaozumbi:~# pkg update --be-name=.... -C0 entire@... root:nacaozumbi:~# reboot

Now, finish the maintenance mode.

root:nacaozumbi:~# sysadm maintain -e

And as the final step, return all the evacuated zones now. As explained before, you would not be able to do it if still in the maintenace mode.

root:nacaozumbi:~# sysadm evacuate -ra sysadm: preparing zones for return ... 5/5 sysadm: returning zones ... 5/5 sysadm: return completed successfully.

Possible enhancements for the future we are considering include specifying multiple targets and a spread policy, with a resource utilisation comparison algorithm that would consider CPU arch, RAM and CPU resources.

Recently Software AG renewed their product availability pages for Oracle Solaris 11 and Oracle Solaris 10 (SPARC & x86-64).
webMethods version 10 is included. The most recent update to this release has significant enhancements in the following areas:

            • Integration

            • API Management

            • webMethods Dynamic Apps

            • Suite Enhancement

(See also:
More information on webMethods 10 system requirements is also available via Software AG’s Empower portal (login required).


“Many of Software AG’s customers rely on the enterprise-class capabilities and reliability of Oracle Solaris. We are pleased to confirm our commitment to Oracle Solaris as part of our ongoing certification.”
– Jonathan Heywood, VP Product Management and Communities, Software AG.

This is part two in my series of posts about Solaris Analytics in the Solaris 11.4 release. You may find part one here.

The Solaris Analytics WebUI (or "bui" for short) is what we use to tie together all our data gathering from the Stats Store. Comprised of two web apps (titled "Solaris Dashboard" and "Solaris Analytics"), enable the webui service via # svcadm enable webui/server

Once the service is online, point your browser at and log in. [Note that the self-signed certificate is that generated by your system, and adding an exception for it in your browser is fine]. Rather than roll our own toolkit, we make use of Oracle Jet, which means we can keep a consistent look and feel across Oracle web applications.

After logging in, you'll see yourself at the Oracle Solaris Web Dashboard, which shows an overview of several aspects of your system, along with Faults (FMA) and Solaris Audit activity if your user has sufficient privileges to read them.


Mousing over any of the visualizations on this page will give you a brief description of what the visualization provides, and clicking on it will take you to a more detailed page.

If you click on the hostname in the top bar (next to Applications), you'll see what we call the Host Drawer. This pulls information from svc:/system/sysstat.

Click the 'x' on the top right to close the drawer.

Selecting Applications / Solaris Analytics will take you to the main part of the bui:

I've select the NFS Client sheet, resulting in the dark shaded box on the right popping up with a description of what the sheet will show you.

Building blocks: faults, utilization and audit events
In the previous installment I mentioned that we wanted to provide a way for you to tie together the many sources of information we provide, so that you could answer questions about your system. This is a small example of how you can do so.

The host these screenshots were taken from is a single processor, four-core Intel-based workstation. In a terminal window I ran # psradm -f 3 Followed a few minutes later by # psradm -n 3
You can see those events marked on each of the visualizations with a blue triangle here:

Now if I mouseover the triangle marking the second offline/online pair, in the Thread Migrations viz, I can see that the system generated a Solaris Audit event:

This allows us to observe that the changes in system behaviour (primarily load average and thread migrations across cores) were correlated with the offlining of a cpu core.

Finally, let's have a look at the Audit sheet. To view the stats on this page, you need to login to the bui as a suitably-privileged user - either root, or a user with the privileges.


# usermod -A $USER  

For this screenshot I not only redid the psradm operations from earlier, I also tried making an ssh connection with an unknown user, and logged in on another of this system's virtual consoles. There are many other things you could observe with the audit subsystem; this is just a glimpse:

Tune in next time for a discussion of using the C and Python bindings to the Stats Store so you can add your own statistics.

February 03, 2018

Alt text

A recording of the talk can be found here.


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers and volunteers of FOSDEM, for hosting a great community event.

Composite acceleration in the X server

One of the persistent problems with the modern X desktop is the number of moving parts required to display application content. Consider a simple PresentPixmap call as made by the Vulkan WSI or GL using DRI3:

  1. Application calls PresentPixmap with new contents for its window

  2. X server receives that call and pends any operation until the target frame

  3. At the target frame, the X server copies the new contents into the window pixmap and delivers a Damage event to the compositor

  4. The compositor responds to the damage event by copying the window pixmap contents into the next screen pixmap

  5. The compositor calls PresentPixmap with the new screen contents

  6. The X server receives that call and either posts a Swap call to the kernel or delays any action until the target frame

This sequence has a number of issues:

  • The operation is serialized between three processes with at least three context switches involved.

  • There is no traceable relation between when the application asked for the frame to be shown and when it is finally presented. Nor do we even have any way to tell the application what time that was.

  • There are at least two copies of the application contents, from DRI3 buffer to window pixmap and from window pixmap to screen pixmap.

We'd also like to be able to take advantage of the multi-plane capabilities in the display engine (where available) to directly display the application contents.

Previous Attempts

I've tried to come up with solutions to this issue a couple of times in the past.

Composite Redirection

My first attempt to solve (some of) this problem was through composite redirection. The idea there was to directly pass the Present'd pixmap to the compositor and let it copy the contents directly from there in constructing the new screen pixmap image. With some additional hand waving, the idea was that we could associate that final presentation with all of the associated redirected compositing operations and at least provide applications with accurate information about when their images were presented.

This fell apart when I tried to figure out how to plumb the necessary events through to the compositor and back. With that, and the realization that we still weren't solving problems inherent with the three-process dance, nor providing any path to using overlays, this solution just didn't seem worth pursuing further.

Automatic Compositing

More recently, Eric Anholt and I have been discussing how to have the X server do all of the compositing work by natively supporting ARGB window content. By changing compositors to place all screen content in windows, the X server could then generate the screen image by itself and not require any external compositing manager assistance for each frame.

Given that a primitive form of automatic compositing is already supported, extending that to support ARGB windows and having the X server manage the stack seemed pretty tractable. We would extend the driver interface so that drivers could perform the compositing themselves using a mixture of GPU operations and overlays.

This runs up against five hard problems though.

  1. Making transitions between Manual and Automatic compositing seamless. We've seen how well the current compositing environment works when flipping compositing on and off to allow full-screen applications to use page flipping. Lots of screen flashing and application repaints.

  2. Dealing with RGB windows with ARGB decorations. Right now, the window frame can be an ARGB window with the client being RGB; painting the client into the frame yields an ARGB result with the A values being 1 everywhere the client window is present.

  3. Mesa currently allocates buffers exactly the size of the target drawable and assumes that the upper left corner of the buffer is the upper left corner of the drawable. If we want to place window manager decorations in the same buffer as the client and not need to copy the client contents, we would need to allocate a buffer large enough for both client and decorations, and then offset the client within that larger buffer.

  4. Synchronizing window configuration and content updates with the screen presentation. One of the major features of a compositing manager is that it can construct complete and consistent frames for display; partial updates to application windows need never be shown to the user, nor does the user ever need to see the window tree partially reconfigured. To make this work with automatic compositing, we'd need to both codify frame markers within the 2D rendering stream and provide some method for collecting window configuration operations together.

  5. Existing compositing managers don't do this today. Compositing managers are currently free to paint whatever they like into the screen image; requiring that they place all screen content into windows would mean they'd have to buy in to the new mechanism completely. That could still work with older X servers, but the additional overhead of more windows containing decoration content would slow performance with those systems, making migration less attractive.

I can think of plausible ways to solve the first three of these without requiring application changes, but the last two require significant systemic changes to compositing managers. Ick.

Semi-Automatic Compositing

I was up visiting Pierre-Loup at Valve recently and we sat down for a few hours to consider how to help applications regularly present content at known times, and to always know precisely when content was actually presented. That names just one of the above issues, but when you consider the additional work required by pure manual compositing, solving that one issue is likely best achieved by solving all three.

I presented the Automatic Compositing plan and we discussed the range of issues. Pierre-Loup focused on the last problem -- getting existing Compositing Managers to adopt whatever solution we came up with. Without any easy migration path for them, it seemed like a lot to ask.

He suggested that we come up with a mechanism which would allow Compositing Managers to ease into the new architecture and slowly improve things for applications. Towards that, we focused on a much simpler problem

How can we get a single application at the top of the window stack to reliably display frames at the desired time, and to know when that doesn't occur.

Coming up with a solution for this led to a good discussion and a possible path to a broader solution in the future.

Steady-state Behavior

Let's start by ignoring how we start and stop this new mode and look at how we want applications to work when things are stable:

  1. Windows not moving around
  2. Other applications idle

Let's get a picture I can use to describe this:

In this picture, the compositing manager is triple buffered (as is normal for a page flipping application) with three buffers:

  1. Scanout. The image currently on the screen

  2. Queued. The image queued to be displayed next

  3. Render. The image being constructed from various window pixmaps and other elements.

The contents of the Scanout and Queued buffers are identical with the exception of the orange window.

The application is double buffered:

  1. Current. What it has displayed for the last frame

  2. Next. What it is constructing for the next frame

Ok, so in the steady state, here's what we want to happen:

  1. Application calls PresentPixmap with 'Next' for its window

  2. X server receives that call and copies Next to Queued.

  3. X server posts a Page Flip to the kernel with the Queued buffer

  4. Once the flip happens, the X server swaps the names of the Scanout and Queued buffers.

If the X server supports Overlays, then the sequence can look like:

  1. Application calls PresentPixmap

  2. X server receives that call and posts a Page Flip for the overlay

  3. When the page flip completes, the X server notifies the client that the previous Current buffer is now idle.

When the Compositing Manager has content to update outside of the orange window, it will:

  1. Compositing Manager calls PresentPixmap

  2. X server receives that call and paints the Current client image into the Render buffer

  3. X server swaps Render and Queued buffers

  4. X server posts Page Flip for the Queued buffer

  5. When the page flip occurs, the server can mark the Scanout buffer as idle and notify the Compositing Manager

If the Orange window is in an overlay, then the X server can skip step 2.

The Auto List

To give the Compositing Manager control over the presentation of all windows, each call to PresentPixmap by the Compositing Manager will be associated with the list of windows, the "Auto List", for which the X server will be responsible for providing suitable content. Transitioning from manual to automatic compositing can therefore be performed on a window-by-window basis, and each frame provided by the Compositing Manager will separately control how that happens.

The Steady State behavior above would be represented by having the same set of windows in the Auto List for the Scanout and Queued buffers, and when the Compositing Manager presents the Render buffer, it would also provide the same Auto List for that.

Importantly, the Auto List need not contain only children of the screen Root window. Any descendant window at all can be included, and the contents of that drawn into the image using appropriate clipping. This allows the Compositing Manager to draw the window manager frame while the client window is drawn by the X server.

Any window at all can be in the Auto List. Windows with PresentPixmap contents available would be drawn from those. Other windows would be drawn from their window pixmaps.

Transitioning from Manual to Auto

To transition a window from Manual mode to Auto mode, the Compositing Manager would add it to the Auto List for the Render image, and associate that Auto List with the PresentPixmap request for that image. For the first frame, the X server may not have received a PresentPixmap for the client window, and so the window contents would have to come from the Window Pixmap for the client.

I'm not sure how we'd get the Compositing Manager to provide another matching image that the X server can use for subsequent client frames; perhaps it would just create one itself?

Transitioning from Auto to Manual

To transition a window from Auto mode to Manual mode, the Compositing manager would remove it from the Auto List for the Render image and then paint the window contents into the render image itself. To do that, the X server would have to paint any PresentPixmap data from the client into the window pixmap; that would be done when the Compositing Manager called GetWindowPixmap.

New Messages Required

For this to work, we need some way for the Compositing Manager to discover windows that are suitable for Auto composting. Normally, these will be windows managed by the Window Manager, but it's possible for them to be nested further within the application hierarchy, depending on how the application is constructed.

I think what we want is to tag Damage events with the source window, and perhaps additional information to help Compositing Managers determine whether it should be automatically presenting those source windows or a parent of them. Perhaps it would be helpful to also know whether the Damage event was actually caused by a PresentPixmap for the whole window?

To notify the server about the Auto List, a new request will be needed in the Present extension to set the value for a subsequent PresentPixmap request.

Actually Drawing Frames

The DRM module in the Linux kernel doesn't provide any mechanism to remove or replace a Page Flip request. While this may get fixed at some point, we need to deal with how it works today, if only to provide reasonable support for existing kernels.

I think about the best we can do is to set a timer to fire a suitable time before vblank and have the X server wake up and execute any necessary drawing and Page Flip kernel calls. We can use feedback from the kernel to know how much slack time there was between any drawing and the vblank and adjust the timer as needed.

Given that the goal is to provide for reliable display of the client window, it might actually be sufficient to let the client PresentPixmap request drive the display; if the Compositing Manager provides new content for a frame where the client does not, we can schedule that for display using a timer before vblank. When the Compositing Manager provides new content after the client, it would be delayed until the next frame.

Changes in Compositing Managers

As described above, one explicit goal is to ease the burden on Compositing Managers by making them able to opt-in to this new mechanism for a limited set of windows and only for a limited set of frames. Any time they need to take control over the screen presentation, a new frame can be constructed with an empty Auto List.

Implementation Plans

This post is the first step in developing these ideas to the point where a prototype can be built. The next step will be to take feedback and adapt the design to suit. Of course, there's always the possibility that this design will also prove unworkable in practice, but I'm hoping that this third attempt will actually succeed.

February 02, 2018

Long time no see, something had happened for sure. So let’s begin with that.

The past

My last post was from 25th August 2017. It was about my GSoC project and how I was preparing the final patch set, that would then be posted to the xorg-devel mailing list.

That’s quite some time ago and I also didn’t follow up on what exactly happened now with the patch set.

Regarding the long pause in communication, it was because of my Master’s thesis in mathematics. I finished it in December and the title is “Vertex-edge graphs of hypersimplices: combinatorics and realizations”.

While the thesis was a lot of work, I’m very happy with the result. I found a relatively intuitive approach to hypersimplices describing them as geometric objects and in the context of graph theory. I even wrote a small application that calculates certain properties of arbitrary hypersimplices and depicts their spectral representations up to the fourth dimension with Qt3D.

I’m currently waiting for my grade, but besides that my somewhat long student career suddenly came to an end.

Regarding my patch set: It did not get merged directly, but I got some valuable feedback from experienced Xserver devs back then. Of course I didn’t want to give up on them, but I had to first work on my thesis and I planned to rework the patches once the thesis was handed in.

At this time I also watched some of the videos from XDC2017 and was happyily surprised that my mentor, Daniel Stone said that he wants my GSoC work in the next Xserver release. His trust in my work really motivated me. I had also some contact to Intel devs, who said that they look forward to my project being merged.

So after I handed in my thesis, I first was working on some other stuff and also needed some time off after the exhausting thesis end phase, but in the last two weeks I reworked my patches and posted a new patch set to the mailing list. I hope this patch set can be accepted in the upcoming Xserver 1.20 release.

The future

I already knew for a prolonged time, that after my master’s degree in mathematics I wanted to leave university and not go for a scientific career. The first reason for this was, that after 10 years of study, most of the time with very abstract topics, I just wanted to interact with some real world problems again. And in retrospective I always was most motivated in my studies when I could connect abstract theory with practical problems in social science or engineering.

Since computers were a passion of mine already at a young age, the currently most interesting techonological achievements happen in the digital field and it is somewhat near to the work of a mathematician, I decided to go into this direction.

I had participated in some programming courses through my studies - and in one semester break created a Pong clone in Java for mobile phones being operated by phone movement; it was fun but will forever remain in the depths of one of my old hard disks somewhere - but I had to learn much more if I wanted to work on interesting projects.

In order build up my own experience pretty exactly two years ago I picked a well-known open-source project, which I found interesting for several reasons, to work on. Of course first I did baby steps, but later on I could accelerate.

So while writing the last paragraph it became apparent to me, that indeed this all was still describing the past. But to know where you’re heading, you need to know where you’re coming from, bla, bla. Anyways finally looking forward I now have the great opportunity to work full-time on KDE technology thanks to Blue Systems.

This foremost means to me to help Martin with the remaining tasks for making Plasma Wayland the new default. I will also work on some ARM devices, what in particular means being more exposed to kernel development. That sounds interesting!

Finally with my GSoC project, I already have some experience working on an upstream project. So another goal for me is to foster the relationship of the Plasma project with upstream graphics development by contributing code and feedback. In comparision to GNOME we were a bit underrepresented in this regard, most of all through financial constraints of course.

Another topic, more long-term, that I’m personally interested in, is KWin as a VR/AR platform. I imagine possible applications kind of like Google tried it with their Glass project. Just as a full desktop experience with multiple application windows floating in front of you. Basically like in every other science fiction movie up to date. But yeah, first our Wayland session, then the future.


Writing these lines I’m sitting in a train to Brussels. So if you want to meet up and talk about anything, you will presumably often find me the next two days at the KDE booth or on Saturday in the graphics devroom. But this is my first time at FOSDEM, so maybe I’ll just stand somewhere in between and am not able to orientate myself anymore. In this case please lead me back to the KDE booth. Thanks in advance and I look forward to meeting you and other interesting people in the next two days at FOSDEM.

Oracle Solaris 11.4 Beta (#solaris114beta) was released earlier this week, here is the announcement blog in case you missed it.

There are lots of updates in this release, including many improvements that simplify development and our ELF and linker support.  Check out these excellent posts from our very own Ali Bahrami to learn more!

Hi All,

Update: Thank you everyone for all the test-reports I've received. The response has been quite overwhelming, with over 50 test-reports received sofar. The results are all over the place, some people see no changes, some people report the aprox. 0.5W saving my own test show and many people also report display problems, sometimes combined with a significant increase in power-consumption. I need to take a closer look at all the results, but right now I believe that the best way forward with this is (unfortunately) a whitelist matching on a combination of panel-id (from edid) and dmi data, so that we can at least enable this on popular models (any model with atleast one user willing to contribute).

As you've probably read already I'm working on improving Linux laptop battery live, previously I've talked about enabling SATA link powermanagement by default. This is now enabled in rawhide / Fedora 28 since January 1st and so far no issues have been reported. This is really good news as this leads to significantly better idle power consumption (1 - 1.5W lower) on laptops with sata disks. Fedora 28 will also enable HDA codec autosuspend and autosuspend for USB Bluetooth controllers, for another (aprox) 0.8W gain.

But we're not done yet, testing on a Lenovo T440s and X240 has shown that enabling Panel Self Refresh (PSR) by setting i915.enable_psr=1 saves another 0.5W. Enabling this on all machines has been tried in the past and it causes problems on some machines. So we will likely need either a blacklist or whitelist for this. I'm leaning towards a whitelist to avoid regressions, but if there are say only 10-20 models which don't work with it a blacklist makes more sense. So the most important thing to do right now is gather more data, hence this blog post.

So I would like to ask everyone who runs Linux on their laptop (with a recent-ish kernel) to test this and gather some data for me:

  1. Check if your laptop uses an eDP panel, do: "ls /sys/class/drm" there should be a card?-eDP-1 there, if not your laptop is using LVDS or DSI for the panel, and this does not apply to your laptop.

  2. Check that your machine supports PSR, do: "cat /sys/kernel/debug/dri/0/i915_edp_psr_status", if this says: "PSR not supported", then this does not apply to your laptop.

  3. Get a baseline powerconsumption measurement, install powertop ("sudo dnf install powertop" on Fedora), then close all your apps except for 1 terminal, maximimze that terminal and run "sudo powertop". Unplug your laptop if plugged in and wait 5 minutes, on some laptops the power measurement is a moving average so this is necessary to get a reliable reading. Now look at the power consumption shown (e.g. 7.95W), watch it for a couple of refreshes as it sometimes spikes when something wakes up to do some work, write down the lowest value you see, this is our base value for your laptops power consumption, write this down.                        Note beware of "dim screen when idle" messing with your brightness, either make sure you do no touch the laptop for a couple of minutes before taking the reading, or turn this feature of in your power-settings.

  4. Add "i915.enable_psr=1" to your kernel cmdline and reboot, check that the LCD panel still works, try suspend/resume and blanking the screen (by e.g. locking it under GNOME3) still work.

  5. Check that psr actually is enabled now (you're panel might not support it), do: "cat /sys/kernel/debug/dri/0/i915_edp_psr_status" and check that it says both: "Enabled: yes" and  "Active: yes"

  6. Measure idle powerconsumption again as described under 1. Make sure you use the same LCD brightness setting as before, write down the new value

  7. Dump your LCD panels edid, run "cat /sys/class/drm/card0-eDP-1/edid > panel-edid"

  8. Send me a mail at with the following in there:

  • Report of success or bad side effects

  • The idle powerconsumption before and after the changes

  • The brand and model of your laptop

  • The "panel-edid" file attached

  • The output of the following commands:

  • cat /proc/cpuinfo | grep "model name" | uniq

  • cat /sys/class/dmi/id/modalias

Once I've info from enough models hopefully I can come up with some way for us to enable PSR be default, or at least build a whitelist with popular laptop models and enable it there.

Thanks & Regards,


In Solaris 11.3 we provided the ability to use the Silicon Secured Memory feature of the Oracle SPARC processors in the M7 and M8 families. An API for applications to explicitly manage ADI (Application Data Integrity) versioning was provided, see adi(2) man page, as well as new memory allocator library - libadimalloc(3LIB).

This required either code changes to the application or arranging to set LD_PRELOAD_64=/usr/lib/64/ in the environment variables before the application started. The libadimalloc(3LIB) allocator was derived from the libumem(3LIB) codebase but doesn't expose all of the features that libumem does.

With Oracle Solaris 11.4 Beta the use of ADI has been integrated into the default system memory allocator in libc(3LIB) and libumem(3LIB), while retaining libadimalloc(3LIB) for backwards compatibility with Oracle Solaris 11.3 systems.

Control of which processes run with ADI protection is now via the Security Extensions Framework, usng sxadm(8), so it is no longer necessary to set the $LD_PRELOAD_64 environment variable.

There are two distinct ADI based protections exposed via the Security Extensions Framework: ADISTACK and ADIHEAP. To complement the existing extensions introduced in earlier Oracle Solaris 11 update releases: ASLR, NXHEAP, NXSTACK (all three of which are available on SPARC and x86 CPU systems).

ADIHEAP is how the ADI protection is exposed via the standard libc memory allocator and via libumem. The ADISTACK extension as the name suggests is for protectiong the register save area of the stack.

$ sxadm status EXTENSION STATUS CONFIGURATION aslr enabled (tagged-files) default (default) nxstack enabled (all) default (default) nxheap enabled (tagged-files) default (default) adiheap enabled (tagged-files) default (default) adistack enabled (tagged-files) default (default)

The above output from sxadm shows the default configuration of an Oracle SPARC M7/M8 based system. What we can see here is that some of the security extensions, including adiheap/adistack, are enabled by default only for tagged-files. Executable binaries can be tagged using ld(1) as documented in sxadm(8), for example if we want to tag an application at build time to use adiheap we would add '-z sx=adiheap'. Note it is not meaningful at this time to tag shared libaries only leaf executable programs.

Most executables in Oracle Solaris were already tagged to run with the aslr, nxstack, nxheap security extensions. Now many of them are also tagged for ADISTACK and ADIHEAP as well. For the Oracle Solaris 11.4 release we have also had to explicitly tag some executables to not run with ADIHEAP and/or ADISTACK, this is either due to outstanding issues when running with an ADI allocator or in some cases to more fundamental issues with how the prgoram itself works (ImageMagic graphics image processing tool is one such example where ADISTACK is explicily disabled).

The sxadm command can be used to start processes with security extensions enabled regardless of the system wide status and binary tagging. For example to start a program that was not tagged at build time with both ADI based protections, in addtion to its binary tagged extensions:

$ sxadm exec -s adistack=enable -s adiheap=enable /path/to/program

It is possible to edit binary executables to add the security extension tags, even if there were none present at link time. Explicit tagging of binaries already installed on a system and delivered by any package management software is not recommened.

If all of the untagged applications that are deployed to be run on a system have been tested to work with the ADI protections then it is possible to chane the system wide defaults rather than having to use sxadm to run the processes:

# sxadm enable adistack,adiheap

The Oracle Solaris 11.4 Beta also has support for use of ADI to protect kernel memory, that is currently undocumented but is planned to be exposed via sxadm by 11.4 release or soon after. The KADI support also includes a signifcant amount of ADI support in mdb, for both live and post-mortem kernel debugging. KADI is enabled by default with precise traps when running a debug build of the kernel. The debug builds are published in the public Oracle Solaris 11.4 Beta repository and can be enabled by running:

# pkg change-variant debug.osnet=true

The use of ADI via the standard libc and libumem memory allocators and by the kernel (in LDOMs and Zones including with live migration/suspend) has enabled the Oracle Solaris engineering team to find and fix many otherwise difficult to find or diagnose bugs. However we are not yet at a point where we believe all applications from all vendors are sufficiently well behaved that the ADISTACK and ADIHEAP protections can be enabled by default.

I’ve done a talk about the kernel community. It’s a hot take, but with the feedback I’ve received thus far I think it was on the spot, and started a lot of uncomfortable, but necessary discussion. I don’t think it’s time yet to give up on this project, even if it will take years.

Without further ado the recording of my talk “Burning Down the Castle is on youtueb”. For those who prefer reading, LWN has you covered with “Too many lords, not enough stewards”. I think Jake Edge and Jon Corbet have done an excellent job in capturing my talk in a balanced fashion. I have also uploaded my slides.

Further Discussion

For understanding abuse dynamics I can’t recommend “Why Does He Do That?: Inside the Minds of Angry and Controlling Men” by Lundy Bancroft enough. All the examples are derived from a few decades of working with abusers in personal relationships, but the patterns and archetypes that Lundy Bancroft extracts transfers extremely well to any other kind of relationship, whether that’s work, family or open source communities.

There’s endless amounts of stellar talks about building better communities. I’d like to highlight just two: “Life is better with Rust’s community automation” by Emily Dunham and “Have It Your Way: Maximizing Drive-Thru Contribution” by VM Brasseur. For learning more there’s lots of great community topic tracks at various conferences, but also dedicated ones - often as unconferences: Community Leadership Summit, including its various offsprings and maintainerati are two I’ve been at and learned a lot.

Finally there’s the fun of trying to change a huge existing organization with lots of inertia. “Leading Change” by John Kotter has some good insights and frameworks to approach this challenge.

Despite what it might look like I’m not quitting kernel hacking nor the community, and I’m happy to discuss my talk over mail and in upcoming hallway tracks.

February 01, 2018

Frequently it is desirable to compare two ELF files. As someone who makes changes to the link-editor, comparing large numbers of built objects is a vital part of verifying any changes. In addition, determining what objects have changed from one build to another, can reduce object distribution to only those objects that have changed. Often, it is simply enlightening to know "what did I change in this ELF file to make it different?".

Various tools exist to compare ELF files, often being scripts that call upon tools like elfdump(1), dis(1), and od(1), to analyze sections in more detail. These tools can be rather slow and produce voluminous output.

ELF files have an inherent problem when trying to analyze differences — even a small change to a section within an object, ie. code changes to .text, .data or .rodata, can result in offset changes that ripple through the ELF file affecting many sections and the data these sections contain. Trying to extrapolate the underlying cause of a difference between two ELF files, amongst all the differences that exist, can be overwhelming.

elfdiff(1) attempts to analyze two ELF files and diagnose the most important changes. Typically, the most significant changes to an object can be conveyed from changes to the symbol table. Functions and data items get added or deleted, or change size. Most of the time this can be sufficient to know/confirm what has changed.

After providing any symbol information, elfdiff continues to grovel down into individual sections, and indicate what might have changed. The output style of the diagnostics are a mix of dis(1) for function diffs, od(1) for data diffs, and elfdump(1) style for sections that elfdump understands and provides high level formatted displays for.

The output is limited. A handful of symbols are displayed first. Sections report a single line of difference, or a single line of difference for each symbol already diagnosed. The styles of each difference, and the order in which they are displayed is covered in the elfdiff(1) man page.

This is an overview diff, appropriate for answering questions such as "What are the high level differences between two nightly builds". It does not replace lower level tools such as elfdump(1), but rather, provides a higher level analysis that might then be used to guide the use of lower level tools.

Some files may contain sections that always change from one build to another, things like comment or signature sections. These can be ignored with the -i option. Sometimes only one or two sections are of interested. These can be specified with the -s option. If you really want to see all the differences between two files, use the -u option. But be careful, the output can be excessive.

The following provides an example of comparing two versions of a shared object, and is lifted directly from the elfdiff(1) man page.

$ elfdiff -e *** symbols: differ < [9287] 0x935c0 0x1bd FUNC GLOB D 0 .text device_offline > [9287] 0x935c0 0x1f5 FUNC GLOB D 0 .text device_offline --- < [10233] 0x111240 0x20 FUNC GLOB D 0 .text new_device_A < [10010] 0x111260 0x64 FUNC GLOB D 0 .text new_device_B --- > [15317] 0 0 NOTY GLOB D 0 UNDEF __assfailline__13 *** section: [1].SUNW_cap: shdr information differs < sh_size: 0xe0 sh_type: [ SHT_SUNW_cap ] > sh_size: 0x120 sh_type: [ SHT_SUNW_cap ] *** section: [1].SUNW_cap: data information differs < 0x80: [8] CA_SUNW_ID 0x2317 i86pc-clmul > 0x80: [8] CA_SUNW_ID 0x1f59 i86pc-avx2 *** section: [6].text: shdr information differs < sh_size: 0x38e205 sh_type: [ SHT_PROGBITS ] > sh_size: 0x38e245 sh_type: [ SHT_PROGBITS ] *** section: [6].text: data information differs --- <sym>: device_offline() < 0x935d9:<sym>+0x19: 48 8b df movq %rdi,%rbx > 0x935d9:<sym>+0x19: 4c 8b e7 movq %rdi,%r12 --- < <sym>: new_device_A < 0x111240:<sym>: 55 push %rbp --- < <sym>: new_device_B < 0x111260:<sym>: 55 push %rbp *** section: [9].strtab: shdr information differs < sh_size: 0x642c5 sh_type: [ SHT_STRTAB ] > sh_size: 0x642d9 sh_type: [ SHT_STRTAB ] *** section: [9].strtab: data information differs < 0x42297: n e _ _ 1 3 8 5 \0 _ _ ... > 0x42297: n e _ _ 1 3 8 4 \0 _ _ ... *** section: [13].rela.text: shdr information differs < sh_size: 0x36d398 sh_type: [ SHT_RELA ] > sh_size: 0x36d428 sh_type: [ SHT_RELA ] *** section: [13].rela.text: data information differs < 0x0: [0] R_AMD64_32S 0x1635f4 0x1638b4 .text > 0x0: [0] R_AMD64_32S 0x163634 0x1638f4 .text *** section: [33].SUNW_ctf: shdr information differs < sh_size: 0xb33c sh_type: [ SHT_PROGBITS ] > sh_size: 0xb4e4 sh_type: [ SHT_PROGBITS ] *** section: [33].SUNW_ctf: data information differs < 0xd: \0 \0 \0 \08 \0 \0 \0 \b2 \03 \0 \0 D ... > 0xd: \0 \0 \0 \08 \0 \0 \0 \ba \03 \0 \0 $ ... *** section: [34].SUNW_signature: data information differs < 0x73: j o h n d o e \t \93 \ab \ff \fa ... > 0x73: j o h n d o e \c2 \c5 \98 r a ...

After the release of the Oracle Solaris 11.4 Beta and the post on the new observability features by James McPherson I've had a few folks ask me if it's possible to export the data from the StatsStore into a format like CSV (Comma-separated values) so they can easily import this into something like Excel.

The answer is: Yes

The main command to access the StatsStore through the CLI is sstore(1), which you can either use as a single command or you can use it as an interactive shell-like environment. For example to browse the statistics namespace. The other way to access the StatsStore is through the Oracle Solaris Dashboard through a browser, where you point to the system's IP address on port 6787. A third way to access the data is through the REST interface (which the Dashboard actually also using to get it's data) but this is something for a later post.

As James pointed out in his post you can use sstore(1) to list the currently available resources, and you can use export to pull data from one or more of those resources. And it's with this last option you can specify the format you want this data to be exported as. The default is tab separated:

demo@solaris-vbox:~$ sstore export -t 2018-02-01T06:47:00 -e 2018-02-01T06:52:00 -i 60 '//:class.cpu//:stat.usage' TIME VALUE IDENTIFIER 2018-02-01T06:47:00 20286401.157722 //:class.cpu//:stat.usage 2018-02-01T06:48:00 20345863.706499 //:class.cpu//:stat.usage 2018-02-01T06:49:00 20405363.144286 //:class.cpu//:stat.usage 2018-02-01T06:50:00 20465694.085729 //:class.cpu//:stat.usage 2018-02-01T06:51:00 20525877.600447 //:class.cpu//:stat.usage 2018-02-01T06:52:00 20585941.862812 //:class.cpu//:stat.usage

But you can also get it in CSV:

demo@solaris-vbox:~$ sstore export -F csv -t 2018-02-01T06:47:00 -e 2018-02-01T06:52:00 -i 60 '//:class.cpu//:stat.usage' time,//:class.cpu//:stat.usage 1517496420000000,20286401.157722 1517496480000000,20345863.706499 1517496540000000,20405363.144286 1517496600000000,20465694.085729 1517496660000000,20525877.600447 1517496720000000,20585941.862812

And in JSON:

demo@solaris-vbox:~$ sstore export -F json -t 2018-02-01T06:47:00 -e 2018-02-01T06:52:00 -i 60 '//:class.cpu//:stat.usage' { "__version": 1, "data": [ { "ssid": "//:class.cpu//:stat.usage", "records": [ { "start-time": 1517496420000000, "value": 20286401.157722 }, { "start-time": 1517496480000000, "value": 20345863.706498999 }, { "start-time": 1517496540000000, "value": 20405363.144285999 }, { "start-time": 1517496600000000, "value": 20465694.085728999 }, { "start-time": 1517496660000000, "value": 20525877.600446999 }, { "start-time": 1517496720000000, "value": 20585941.862812001 } ] } ] }

Each of these have their own manual entries sstore.cvs(5) and sstore.json(5).

Now the question rises: How do you get something interesting/useful? Well, part of this is about learning what the StatsStore can gather for you and the types of tricks you can do with the data before you export it. This is where the Dashboard is a great learning guide. When you first log in you get a landing page very similar to this:

Note: The default install of Oracle Solaris won't have a valid cert and the browser will complain it's an untrusted connection. Because you know the system you can add an exception and connect.

Because this post is not about exploring the Dashboard but about exporting data I'll just focus on that. But by all means click around.

So if you click on the "CPU Utilization by mode (%)" graph you're essentially double clicking on that data and you'll got to a statistics sheet we've built showing all kinds of aspects on CPU utilization and this should look something like this:

Note: You can see my VirtualBox instance is pretty busy.

So these graphs look pretty interesting, but how do I get to this data? Well, if we're interested in the Top Processes, first click on Top Processes by CPU Utilization and this should bring up this overlay window:

Note: This shows this statistic is only temporarily collected (something you could make persistent here) and that the performance impact of collecting this statistic is very low.

Now click on "proc cpu-percentage" and this will show what is being collected to create this graph:

This shows the SSID of the data in this graph. A quick look at this show it's looking in the process data //:class.proc, then it's using a wildcard on the resources //:res.* which grabs all the entries available, then it selects the statistic for CPU usage in percent //:stat.cpu-percentage, and finally it does a top operation on this list and selects the to 5 processes // (see ssid-op(7) for more info). And when I use this on the command line I get:

demo@solaris-vbox:~$ sstore export -F CSV -t 2018-02-01T06:47:00 -i 60 '//:class.proc//:res.*//:stat.cpu-percentage//' time,//:class.proc//:res.firefox/2035/demo//:stat.cpu-percentage//,//:class.proc//:res.rad/204/root//:stat.cpu-percentage//,//:class.proc//:res.gnome-shell/1316/demo//:stat.cpu-percentage//,//:class.proc//:res.Xorg/1039/root//:stat.cpu-percentage//,//:class.proc//:res.firefox/2030/demo//:stat.cpu-percentage// 1517496480000000,31.378174,14.608765,1.272583,0.500488,0.778198 1517496540000000,33.743286,8.999634,3.271484,1.477051,2.059937 1517496600000000,41.018677,9.545898,5.603027,3.170776,3.070068 1517496660000000,37.011719,8.312988,1.940918,0.958252,1.275635 1517496720000000,29.541016,8.514404,9.561157,4.693604,0.869751

Where "-F CSV" tells it to output to CSV (I could also have used lowercase csv), "-t 2018-02-01T06:47:00" is the begin time of what I want to look at, I'm not using an end time which would be similar but then with an "-e", the "-i 60" tells it I want the length of each sample to be 60 seconds, and then I use the SSID from above.

Note: For the CSV export to work you'll need to specify at least the begin time (-t) and the length of each sample (-i), otherwise the export will error. You also want to export data the StatStore has actually gathered or it will also not work.

In the response the first line is the header with what each column is (time, firefox, rad, gnome-shell, Xorg, firefox) and then the values where the first column is UNIX time.

Similarly if I look at what data is driving the CPU Utilization graph I get the following data with this SSID:

demo@solaris-vbox:~$ sstore export -F csv -t 2018-02-01T06:47:00 -i 60 '//:class.cpu//:stat.usage//:part.mode(user,kernel,stolen,intr)//:op.rate//:op.util' time,//:class.cpu//:stat.usage//:part.mode(user,kernel,stolen,intr)//:op.rate//:op.util(intr),//:class.cpu//:stat.usage//:part.mode(user,kernel,stolen,intr)//:op.rate//:op.util(kernel),//:class.cpu//:stat.usage//:part.mode(user,kernel,stolen,intr)//:op.rate//:op.util(user) 1517496420000000,2.184663,28.283780,31.322588 1517496480000000,2.254090,16.524862,32.667445 1517496540000000,1.568696,19.479255,41.112911 1517496600000000,1.906700,18.194955,39.069998 1517496660000000,2.326821,18.103397,39.564789 1517496720000000,2.484758,17.909993,38.684371

Note: Even though we've asked for data on user, kernel, stolen, and intr (interrupts), it doesn't return data on stolen as it doesn't have this.

Also Note: It's using two other operations rate and util in combination to create this result (also see ssid-op(7) for more info).

This should allow you to click around the Dashboard and learn what you can gather and how to export it. We'll talk more on mining interesting data and for example using the JSON output in later posts.