
| January 03, 2018 | |
For VC5 features:
For shader_image_load_store, we (like some cases of Intel) have to do manual tiling address decode and format conversion based on top of SSBO accesses, and (like freedreno and some cases of Intel) want to use normal texture accesses for plain loads. I’ve started on a NIR pass that will split apart the tiling math from the format conversion math so that we can all hopefully share some code here. Some tests are already passing.
The compute shaders are interesting on this hardware. There’s no custom hardware support for them. Instead, you can emit a certain series of line primitives to the fragment shader such that you get shader instances spawned on all the QPUs in the right groups. It’s not pretty, but it means that the infrastructure is totally shared.
On the VC4 front, I got a chance to try out Boris’s performance counter work, using 3DMMES as a testcase. He found frameretrace really hard to work on, and so we don’t have a port of it yet (the fact that porting is necessary seems like a serious architectural problem). However, I was able to use things like “apitrace –pdrawcalls=GL_AMD_performance_monitor:QPU-total-clk-cycles-waiting-varyings” to poke at the workload.
There are easy ways to be led astray with the performance counter support on a tiling GPU (since we flush the frame at each draw call, the GPU spends all its time loading/storing the frame buffer instead of running shaders, so idle clock cycles are silly to look at the draw call level). However, being able to look at things like cycles spent in the shaders of each draw call let us approximate the total time spent in each shader, to direct optimization work.
| December 29, 2017 | |
| December 19, 2017 | |

Since joining an enterprise (the world’s largest business-travel company) 6 months ago to drive their DevOps transformation, my ongoing mental evolution regarding the value of technology has gone through an almost religious rebirth. I now think in a completely different way than I did 10 years ago about what technology is important and when you need it. If you want to become a 10x engineer, you need a different perspective than just working on things because they seem cool. It’s about working toward the right outcomes, whereas most of us focus on the inputs (what tech you use, how many hours you work).
It all comes down to business value. You need to contribute to one of the core factors of business value, or however incredible the technology is, it just doesn’t make a difference. If you don’t know what that really means, you’re not alone — most of the technologists I know have trouble articulating the business model of their employers.
I think about it as 4 primary factors:
Although many other factors have an impact upon business value, those are 4 of the most important ones that can make you consistently successful as a technologist. The key is to understand which ones play into your work, so you can act accordingly in your day-to-day efforts and as part of your career strategy. Are you building software for a cost center, a growth incubator, a risk center, or at a company that cares to invest in speed? Taking full advantage of this approach could make you the 10x engineer you’ve always wanted to be. Best of luck in your journey, and may you spend time where it matters!
Having spent 20 years of my life on Desktop Linux I thought I should write up my thinking about why we so far hasn’t had the Linux on the Desktop breakthrough and maybe more importantly talk about the avenues I see for that breakthrough still happening. There has been a lot written of this over the years, with different people coming up with their explanations. My thesis is that there really isn’t one reason, but rather a range of issues that all have contributed to holding the Linux Desktop back from reaching a bigger market. Also to put this into context, success here in my mind would be having something like 10% market share of desktop systems, that to me means we reached critical mass. So let me start by listing some of the main reasons I see for why we are not at that 10% mark today before going onto talking about how I think that goal might possible to reach going forward.
Things that have held us back
One of the most common explanations for why the Linux Desktop never caught on more is the fragmented state of the Linux Desktop space. We got a large host of desktop projects like GNOME, KDE, Enlightenment, Cinnamon etc. and a even larger host of distributions shipping these desktops. I used to think this state should get a lot of the blame, and I still believe it owns some of the blame, but I have also come to conclude in recent years that it is probably more of a symptom than a cause. If someone had come up with a model strong enough to let Desktop Linux break out of its current technical user niche then I am now convinced that model would easily have also been strong enough to leave the Linux desktop fragmentation behind for all practical purposes. Because at that point the alternative desktops for Linux would be as important as the alternative MS Windows shells are. So in summary, the fragmentation hasn’t helped for sure and is still not helpful, but it is probably a problem that has been overstated.
Another common item that has been pointed to is the lack of applications. We know that for sure in the early days of Desktop Linux the challenge you always had when trying to convince anyone of moving to Desktop Linux was that they almost invariably had one or more application they relied on that was only available on Windows. I remember in one of my first jobs after University when I worked as a sysadmin we had a long list of these applications that various parts of the organization relied on, be that special tools to interface with a supplier, with the bank, dealing with nutritional values of food in the company cafeteria etc. This is a problem that has been in rapid decline for the last 5-10 years due to the move to web applications, but I am sure that in a given major organization you can still probably find a few of them. But between the move to the web and Wine I don’t think this is a major issue anymore. So in summary this was a major roadblock in the early years, but is a lot less of an impediment these days.
Adopting a new platform is always easier if you can take the applications you are familiar with you. So the lack of things like MS Office and Adobe Photoshop would always contribute to making a switch less likely. Just because in addition to switching OS you would also have to learn to use new tools. And of course along those lines there where always the challenge of file format compatibility, in the early days in a hard sense that you simply couldn’t reliably load documents coming from some of these applications, to more recently softer problems like lack of metrically identical fonts. The font for example issue has been mostly resolved due to Google releasing fonts metrically compatible with MS default fonts a few years ago, but it was definitely a hindrance for adoption for many years. The move to web for a lot of these things has greatly reduced this problem too, with organizations adopting things like Google Docs at rapid pace these days. So in summary, once again something that used to be a big problem, but which is at least a lot less of a problem these days, but of course there are still apps not available for Linux that does stop people from adopting desktop linux.
This is another item that many people have brought up over the years. I think I have personally vacillated over the importance of this one multiple times over the years. Changing APIs are definitely not a fun thing for developers to deal with, it adds extra work often without bringing direct benefit to their application. Linux packaging philosophy probably magnified this problem for developers with anything that could be split out and packaged separately was, meaning that every application was always living on top of a lot of moving parts. That said the reason I am sceptical to putting to much blame onto this is that you could always find stable subsets to rely on. So for instance if you targeted GTK2 or Qt back in the day and kept away from some of the more fast moving stuff offered by GNOME and KDE you would not be hit with this that often. And of course if the Linux Desktop market share had been higher then people would have been prepared to deal with these challenges regardless, just like they are on other platforms that keep changing and evolving quickly like the mobile operating systems.
This might of course be the result of subjective memory, but one of the times where it felt like there could have been a Linux desktop breakthrough was at the same time as Linux on the server started making serious inroads. The old Unix workstation market was coming apart and moving to Linux already, the worry of a Microsoft monopoly was at its peak and Apple was in what seemed like mortal decline. There was a lot of media buzz around the Linux desktop and VC funded companies was set up to try to build a business around it. Reaching some kind of critical mass seemed like it could be within striking distance. Of course what happened here was that Steve Jobs returned to Apple and we suddenly had MacOSX come onto the scene taking at least some air out of the Linux Desktop space. The importance of this one I do find exceptionally hard to quantify though, part of me feels it had a lot of impact, but on the other hand it isn’t 100% clear to me that the market and the players at the time would have been able to capitalize even if Apple had gone belly-up.
In the first 10 years of Desktop linux there was no doubt that Microsoft was working hard to try to nip any sign of Desktop Linux gaining any kind of foothold or momentum. I do remember for instance that Novell for quite some time was trying to establish a serious Desktop Linux business after having bought Miguel de Icaza’s company Helix Code. However it seemed like a pattern quickly emerged that every time Novell or anyone else tried to announce a major Linux desktop deal, Microsoft came running in offering next to free Windows licensing to get people to stay put. Looking at Linux migrations even seemed like it became a goto policy for negotiating better prices from Microsoft. So anyone wanting to attack the desktop market with Linux would have to contend with not only market inertia, but a general depression of the price of a desktop operating systems, and knowing that Microsoft would respond to any attempt to build momentum around Linux desktop deals with very aggressive sales efforts. So in summary, this probably played an important part as it meant that the pay per copy/subscription business model that for instance Red Hat built their server business around became really though to make work in the desktop space. Because the price point ended up so low it required gigantic volumes to become profitable, which of course is a hard thing to quickly achieve when fighting against an entrenched market leader. So in summary Microsoft in some sense successfully fended of Linux breaking through as a competitor although it could be said they did so at the cost of fatally wounding the per copy fee business model they built their company around and ensured that the next wave of competitors Microsoft had to deal with like iOS and Android based themselves on business models where the cost of the OS was assumed to be zero, thus contributing to the Windows Phone efforts being doomed.
One of the big aspirations of the Linux community from the early days was the idea that a open source operating system would enable more people to be able to afford running a computer and thus take part in the economic opportunities that the digital era would provide. For the desktop space there was always this idea that while Microsoft was entrenched in North America and Europe there was this ocean of people in the rest of the world that had never used a computer before and thus would be more open to adopting a desktop linux system. I think this so far panned out only in a limited degree, where running a Linux distribution has surely opened job and financial opportunities for a lot of people, yet when you look at things from a volume perspective most of these potential Linux users found that a pirated Windows copy suited their needs just as much or more. As an anecdote here, there was recently a bit of noise and writing around the sudden influx of people on Steam playing Player Unknown: Battlegrounds, as it caused the relatively Linux marketshare to decline. So most of these people turned out to be running Windows in Mandarin language. Studies have found that about 70% of all software in China is unlicensed so I don’t think I am going to far out on a limb here assuming that most of these gamers are not providing Microsoft with Windows licensing revenue, but it does illustrate the challenge of getting these people onto Linux as they already are getting an operating system for free. So in summary, in addition to facing cut throat pricing from Microsoft in the business sector one had to overcome the basically free price of pirated software in the consumer sector.
So few people probably don’t remember or know this, but Red Hat was actually founded as a desktop Linux company. The first major investment in software development that Red Hat ever did was setting up the Red Hat Advanced Development Labs, hiring a bunch of core GNOME developers to move that effort forward. But when Red Hat pivoted to the server with the introduction of Red Hat Enterprise Linux the desktop quickly started playing second fiddle. And before I proceed, all these events where many years before I joined the company, so just as with my other points here, read this as an analysis of someone without first hand knowledge. So while Red Hat has always offered a desktop product and have always been a major contributor to keeping the Linux desktop ecosystem viable, Red Hat was focused on the server side solutions and the desktop offering was always aimed more narrowly things like technical workstation customers and people developing towards the RHEL server. It is hard to say how big an impact Red Hats decision to not go after this market has had, on one side it would probably have been beneficial to have the Linux company with the deepest pockets and the strongest brand be a more active participant, but on the other hand staying mostly out of the fight gave other companies a bigger room to give it a go.
This bullet point is probably going to be somewhat controversial considering I work for Red Hat (although this is my private blog my with own personal opinions), but on the other hand I feel one can not talk about the trajectory of the Linux Desktop over the last decade without mentioning Canonical and Ubuntu. So I have to assume that when Mark Shuttleworth was mulling over doing Ubuntu he probably saw a lot of the challenges that I mention above, especially the revenue generation challenges that the competition from Microsoft provided. So in the end he decided on the standard internet business model of the time, which was to try to quickly build up a huge userbase and then dealing with how to monetize it later on. So Ubuntu was launched with an effective price point of zero, in fact you could even get install media sent to you for free. The effort worked in the sense that Ubuntu quickly became the biggest player in the Linux desktop space and it certainly helped the Linux desktop marketshare grow in the early years. Unfortunately I think it still basically failed, and the reason I am saying that is that it didn’t manage to grow big enough to provide Ubuntu with enough revenue through their appstore or their partner agreements to allow them to seriously re-invest in the Linux Desktop and invest in the kind of marketing effort needed to take Linux to a less super technical audience. So once it plateaued what they had was enough revenue to keep what is a relatively barebones engineering effort going, but not the kind of income that would allow them to steadily build the Linux Desktop market further. Mark then tried to capitalize on the mindshare and market share he had managed to build, by branching out into efforts like their TV and Phone efforts, but all those efforts eventually failed.
It would probably be an article in itself to deeply discuss why the grow userbase strategy failed here vs why for instance Android succeeded with this model, but I think the short version goes back to the fact that you had an entrenched market leader and the Linux Desktop isn’t different enough from a Mac or Windows desktops to drive the type of market change the transition from feature phones to smartphones was.
And to be clear I am not criticizing Mark here for the strategy he choose, if I where in his shoes back when he started Ubuntu I am not sure I would have been able to come up a different strategy that would have been plausible to succeed from his starting point. That said it did contribute to even further push the expected price of desktop Linux down and thus making it even harder for people to generate significant revenue from desktop linux. On the other hand one can argue that this would likely have happened anyway due to competitive pressure and Windows piracy. Canonicals recent focus pivot away from the desktop towards trying to build a business in the server and IoT space is in some sense a natural consequence of hitting the desktop growth plateau and not having enough revenue to invest in further growth.
So in summary, what was once seen as the most likely contender to take the Linux Desktop to critical mass turned out to have taken off with to little rocket fuel and eventually gravity caught up with them. And what we can never know for sure is if they during this run sucked so much air out of the market that it kept someone who could have taken us further with a different business model from jumping in.
THis one is a bit of a chicken and egg issue. Yes, lack of (perfect) hardware support has for sure kept Linux back on the Desktop, but lack of marketshare has also kept hardware support back. As with any system this is a question of reaching critical mass despite your challenges and thus eventually being so big that nobody can afford ignoring you. This is an area where we even today are still not fully there yet, but which I do feel we are getting closer all the time. When I installed Linux for the very first time, which I think was Red Hat Linux 3.1 (pre RHEL days) I spent about a weekend fiddling just to get my sound card working. I think I had to grab a experimental driver from somewhere and compile it myself. These days I mostly expect everything to work out of the box except more unique hardware like ambient light sensors or fingerprint readers, but even such devices are starting to land, and thanks to efforts from vendors such as Dell things are looking pretty good here. But the memory of these issues is long so a lot of people, especially those not using Linux themselves, but have heard about Linux, still assume hardware support is a very much hit or miss issue still.
So any who has read my blog posts probably know I am an optimist by nature. This isn’t just some kind of genetic disposition towards optimism, but also a philosophical belief that optimism breeds opportunity while pessimism breeds failure. So just because we haven’t gotten the Linux Desktop to 10% marketshare so far doesn’t mean it will not happen going forward. It just means we haven’t achieved it so far. One of the key identifies of open source is that it is incredibly hard to kill, because unlike proprietary software, just because a company goes out of business or decides to shut down a part of its business, the software doesn’t go away or stop getting developed. As long as there is a strong community interested in pushing it forward it remains and evolves and thus when opportunity comes knocking again it is ready to try again. And that is definitely true of Desktop Linux which from a technical perspective is better than it has ever been, the level of polish is higher than ever before, the level of hardware support is better than ever before and the range of software available is better than ever before.
And the important thing to remember here is that we don’t exist in a vacuum, the world around us constantly change too, which means that the things that blocked us in the past or the companies that blocked us in the past might no be around or able to block us tomorrow. Apple and Microsoft are very different companies today than they where 10 or 20 years ago and their focus and who they compete with are very different. The dynamics of the desktop software market is changing with new technologies and paradigms all the time. Like how online media consumption has moved from things like your laptop to phones and tablets for instance. 5 years ago I would have considered iTunes a big competitive problem, today the move to streaming services like Spotify, Hulu, Amazon or Netflix has made iTunes feel archaic and a symbol of bygone times.
And many of the problems we faced before, like weird Windows applications without a Linux counterpart has been washed away by the switch to browser based applications. And while Valve’s SteamOS effort didn’t taken off, it has provided Linux users with access to a huge catalog of games, removing a reason that I know caused a few of my friends to mostly abandon using Linux on their computers. And you can actually as a consumer buy linux from a range of vendors now, who try to properly support Linux on their hardware. And this includes a major player like Dell and smaller outfits like System76 and Purism.
And since I do work for Red Hat managing our Desktop Engineering team I should address the question of if Red Hat will be a major driver in taking Desktop linux to that 10%? Well Red Hat will continue to support end evolve our current RHEL Workstation product, and we are seeing a steady growth of new customers for it. So if you are looking for a solid developer workstation for your company you should absolutely talk to Red Hat sales about RHEL Workstation, but Red Hat is not looking at aggressively targeting general consumer computers anytime soon. Caveat here, I am not a C-level executive at Red Hat, so I guess there is always a chance Jim Whitehurst or someone else in the top brass is mulling over a gigantic new desktop effort and I simply don’t know about it, but I don’t think it is likely and thus would not advice anyone to hold their breath waiting for such a thing to be announced :). That said Red Hat like any company out there do react to market opportunities as they arise, so who knows what will happen down the road. And we will definitely keep pushing Fedora Workstation forward as the place to experience the leading edge of the Desktop Linux experience and a great portal into the world of Linux on servers and in the cloud.
So to summarize; there are a lot of things happening in the market that could provide the right set of people the opportunity they need to finally take Linux to critical mass. Whether there is anyone who has the timing and skills to pull it off is of course always an open question and it is a question which will only be answered the day someone does it. The only thing I am sure of is that Linux community are providing a stronger technical foundation for someone to succeed with than ever before, so the question is just if someone can come up with the business model and the market skills to take it to the next level. There is also the chance that it will come in a shape we don’t appreciate today, for instance maybe ChromeOS evolves into a more full fledged operating system as it grows in popularity and thus ends up being the Linux on the Desktop end game? Or maybe Valve decides to relaunch their SteamOS effort and it provides the foundation for a major general desktop growth? Or maybe market opportunities arise that will cause us at Red Hat to decide to go after the desktop market in a wider sense than we do today? Or maybe Endless succeeds with their vision for a Linux desktop operating system? Or maybe the idea of a desktop operating system gets supplanted to the degree that we in the end just sit there saying ‘Alexa, please open the IDE and take dictation of this new graphics driver I am writing’ (ok, probably not that last one ;)
And to be fair there are a lot of people saying that Linux already made it on the desktop in the form of things like Android tablets. Which is technically correct as Android does run on the Linux kernel, but I think for many of us it feels a bit more like a distant cousin as opposed to a close family member both in terms of use cases it targets and in terms of technological pedigree.
As a sidenote, I am heading of on Yuletide vacation tomorrow evening, taking my wife and kids to Norway to spend time with our family there. So don’t expect a lot new blog posts from me until I am back from DevConf in early February. I hope to see many of you at DevConf though, it is a great conference and Brno is a great town even in freezing winter. As we say in Norway, there is no such thing as bad weather, it is only bad clothing.
| December 18, 2017 | |
I’m sad to say it’s the end of the road for me with Gentoo, after 13 years volunteering my time (my “anniversary” is tomorrow). My time and motivation to commit to Gentoo have steadily declined over the past couple of years and eventually stopped entirely. It was an enormous part of my life for more than a decade, and I’m very grateful to everyone I’ve worked with over the years.
My last major involvement was running our participation in the Google Summer of Code, which is now fully handed off to others. Prior to that, I was involved in many things from migrating our X11 packages through the Big Modularization and maintaining nearly 400 packages to serving 6 terms on the council and as desktop manager in the pre-council days. I spent a long time trying to change and modernize our distro and culture. Some parts worked better than others, but the inertia I had to fight along the way was enormous.
No doubt I’ve got some packages floating around that need reassignment, and my retirement bug is already in progress.
Thanks, folks. You can reach me by email using my nick at this domain, or on Twitter, if you’d like to keep in touch.
| December 15, 2017 | |
So I spent a few hours polishing my crystal ball today, so here are some predictions for Linux on the Desktop in 2018. The advantage of course for me to publish these now is that I can then later selectively quote the ones I got right to prove my brilliance and the internet can selectively quote the ones I got wrong to prove my stupidity :)
Prediction 1: Meson becomes the defacto build system of the Linux community
Meson has been going from strength to strength this year and a lot of projects
which passed on earlier attempts to replace autotools has adopted it. I predict this
trend will continue in 2018 and that by the end of the year everyone agrees that Meson
has replaced autotools as the Linux community build system of choice. That said I am not
convinced the Linux kernel itself will adopt Meson in 2018.
Prediction 2: Rust puts itself on a clear trajectory to replace C and C++ for low level programming
Another rising start of 2017 is the programming language Rust. And while its pace of adoption
will be slower than Meson I do believe that by the time 2018 comes to a close the general opinion is
that Rust is the future of low level programming, replacing old favorites like C and C++. Major projects
like GNOME and GStreamer are already adopting Rust at a rapid pace and I believe even more projects will
join them in 2018.
Prediction 3: Apples decline as a PC vendor becomes obvious
Ever since Steve Jobs died it has become quite clear in my opinion that the emphasis
on the traditional desktop is fading from Apple. The pace of hardware refreshes seems
to be slowing and MacOS X seems to be going more and more stale. Some pundits have already
started pointing this out and I predict that in 2018 Apple will be no longer consider the
cool kid on the block for people looking for laptops, especially among the tech savvy crowd.
Hopefully a good opportunity for Linux on the desktop to assert itself more.
Prediction 4: Traditional distro packaging for desktop applications
will start fading away in favour of Flatpak
From where I am standing I think 2018 will be the breakout year for Flatpak as a replacement
for gettings your desktop applications as RPMS or debs. I predict that by the end of 2018 more or
less every Linux Desktop user will be at least running 1 flatpak on their system.
Prediction 5: Linux Graphics competitive across the board
I think 2018 will be a breakout year for Linux graphics support. I think our GPU drivers and API will be competitive with any other platform both in completeness and performance. So by the end of 2018 I predict that you will see Linux game ports by major porting houses
like Aspyr and Feral that perform just as well as their Windows counterparts. What is more I also predict that by the end of 2018 discreet graphics will be considered a solved problem on Linux.
Prediction 6: H265 will be considered a failure
I predict that by the end of 2018 H265 will be considered a failed codec effort and the era of royalty bearing media codecs will effectively start coming to and end. H264 will be considered the last successful royalty bearing codec and all new codecs coming out will
all be open source and royalty free.
| December 11, 2017 | |
It’s been a while since I posted a TWIV update, so this one will be big:
For VC5 GL features:
While running DEQP tests on all this (which unfortunately don’t complete yet due to running out of memory on my 7268 without swap), I’ve also rebased my Vulkan series and started on implementing image layout for it.
I also tested Timothy Arceri’s gallium NIR linking pass. The goal of that is to pack and dead-code eliminate varyings up in shared code. It’s a net ~0 effect on vc4 currently, but it will help vc5, and I may be able to dead-code eliminate some of the vc4 compiler backend now that the IR coming in to the driver is cleaner.
On the VC4 front, Boris has posted a series for performance counter support. This was a pretty big piece of work, and our hope is that with the addition of performance counters we’ll be able to dig into those workloads where vc4 is slower than the closed driver and actually fix them. Unfortunately he hasn’t managed to build frameretrace yet, so we haven’t really tested it on its final intended workload.
For VC4 GL, I did a bit of work on minetest performance, improving the game’s fps from around 15 to around 17. Its desktop GL renderer is really unfortunate, using a lot of immediate-mode GL, but I was completely unable to get its GLES renderer branch to build. It also lacks a reproducable/scriptable benchmark mode, so most of my testing was against an apitrace, which is very hard to get useful performance data from.
I debugged a crash in vc4 with large vertex counts that a user had reported, landed a fix for a kernel memory leak, and landed Dave Stevenson’s HVS format support (part of his work on getting video decode into vc4 GL).
Finally, I did a bit of research and work to help unblock Dave Stevenson’s unicam driver (the open source camera driver). Now that we have an ack for the DT binding, we should be able to get it merged for 4.16!
| December 06, 2017 | |
| November 28, 2017 | |

So let's start off by covering how ChromiumOS relates to ChromeOS. The ChromiumOS project is essentially ChromeOS minus branding and some packages for things like the media digital restrictions management.
But on the whole, almost everything is there, and the pieces that aren't, you don't need.
In order to check out ChromiumOS and other large Google projects, you'll need depot tools.
git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
export PATH=$PATH:$(PWD)/depot_tools
Maybe you'd want to add the PATH export to your .bashrc.
mkdir chromiumos
cd chromiumos
repo init -u https://chromium.googlesource.com/chromiumos/manifest.git --repo-url https://chromium.googlesource.com/external/repo.git [-g minilayout]
repo sync -j75
cros_sdk
export BOARD=amd64-generic
./setup_board --board …| November 20, 2017 | |
Another series of VC5 GL features this week:
For VC4, I reviewed and landed a bugfix that would cause kernel oopses in the IRQ handler path for the out-of-memory signal. I think this covered the only known oops in VC4’s 3D.
I also spent a while on the VC4 backport, debugging a regression related to the DSI changes: Now when the panel is disconnected, the VC4 driver won’t load when the DSI panel is present in the overlay. Unfortunately, there aren’t really good solutions for this because in the ARM DT world, the assumption is that your hardware is fixed and you can’t just optionally plug hardware in without doing a bunch of manual editing of your DT. I’m working with the DRM bridge maintainers to come up with a plan.
| November 18, 2017 | |
Looking over my blog recently, I realized I never did a post for the Solaris 11.3 GA release to list the bundled software updates, as I’d previously done for the Solaris 11.1, Solaris 11.2 beta, Solaris 11.2 GA, and Solaris 11.3 beta releases. But that was two years ago, so telling you now what we shipped then would be boring. Instead, I've put together a list of what's changed in the Solaris support repository since then.
When we shipped the 11.3 beta, James announced that Oracle Instant Client 12.1 packages were now in the IPS repo for building software that connects to Oracle databases. He's now added Oracle Instant Client 12.2 packages as well in the IPS repo, with the old packages renamed to allow installing both versions.
While there's plenty of updates and additions in this package list, there's also a good number of older packages removed, especially those which were no longer being supported by the upstream community. While the End of Features Notices page gives a heads up for what's coming out in some future release, the SRU readmes also have a section listing things scheduled to be obsoleted soon in the support repository. For instance, the SRU 26 Read Me announces the removal of Tomcat 6.0 has been completed and warns these packages will be removed soon:
This table shows most of the changes to the bundled packages between the original Solaris 11.3 release and the latest Solaris 11.3 support repository update (SRU26, aka 11.3.26, released November 15, 2017). These show the versions they were released with, and not later versions that may also be available via the FOSS Evaluation Packages for existing Solaris releases.
As with previous posts, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn’t change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.
Package Upstream 11.3 11.3 SRU 26 archiver/gnu-tar GNU tar 1.27.1 1.28 archiver/unrar RARLAB 4.2.4 5.5.5 benchmark/gtkperf GtkPerf 0.40 not included cloud/openstack/cinder OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/glance OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/heat OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/horizon OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/ironic OpenStack 0.2014.2.1 0.2015.1.2 cloud/openstack/keystone OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/neutron OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/nova OpenStack 0.2014.2.2 0.2015.1.2 cloud/openstack/swift OpenStack 2.2.2 2.3.0 compress/p7zip p7zip 9.20.1 15.14.1 compress/pixz Dave Vasilevsky 1.0 1.0.6 compress/xz xz 5.0.1 5.2.3 crypto/gnupg GnuPG 2.0.27 2.0.30 database/mysql-55 MySQL 5.5.43 5.5.57 database/mysql-56 MySQL 5.6.21 5.6.37 database/mysql-57 MySQL not included 5.7.17 database/oracle/instantclient-122 Oracle Instant Client not included 12.1.0.2.0 database/sqlite-3 3.8.8.1 3.17.0 desktop/project-management/openproj OpenProj 1.4 not included desktop/project-management/planner GNOME Planner 0.14.4 not included desktop/studio/jokosher Jokosher 0.11.5 not included desktop/system-monitor/gkrellm GKrellM 2.3.4 not included developer/astdev93 AT&T Software Technology (AST) 93.21.0.20110208| November 14, 2017 | |
Another series of VC5 GL features this week:
For VC4, the big news is that we’ve landed Boris’s MADVISE support in Mesa as well now. This means that if you have a 4.15 kernel and the next release of Mesa, the kernel will now be able to clean up the userspace BO cache when we run out of CMA. This doesn’t prevent all GL_OUT_OF_MEMORY errors, but it should reduce the circumstances where you can hit them.
I spent a while putting together a backport of all our kernel development for Raspbian’s rpi-4.9.y branch. So much has happened in DRM in the last year, that it’s getting harder and harder to backport our work. However, the PR I sent brings in fully functional support for the DSI panel (no more purple flickering!) and fix for a longstanding race that could crash the kernel when powering down the GPU (thanks to Stefan Schake for debugging and providing a patch!)
I also fixed the VC4 build and armhf cross-builds with the new meson build system, after Timothy Arceri noted that it wasn’t working on his Pi. I’m now happily using meson for all of my Mesa development.
| November 07, 2017 | |
| November 06, 2017 | |
$ cd ~/srcBuild minijail:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git checkout v4.12
$ make x86_64_defconfig
$ make bzImage
$ cd ..
$ git clone https://android.googlesource.com/platform/external/minijailBuild crosvm:
$ cd minijail
$ make
$ cd ..
$ git clone https://chromium.googlesource.com/a/chromiumos/platform/crosvmGenerate rootfs:
$ cd crosvm
$ LIBRARY_PATH=~/src/minijail cargo build
$ cd ~/src/crosvmRun crosvm:
$ dd if=/dev/zero of=rootfs.ext4 bs=1K count=1M
$ mkfs.ext4 rootfs.ext4
$ mkdir rootfs/
$ sudo mount rootfs.ext4 rootfs/
$ debootstrap testing rootfs/
$ sudo umount rootfs/
$ LD_LIBRARY_PATH=~/src/minijail ./target/debug/crosvm run -r rootfs.ext4 --seccomp-policy-dir=./seccomp/x86_64/ ~/src/linux/arch/x86/boot/compressed/vmlinux.binThe work ahead includes figuring out the best way for Wayland clients in the guest interact with the compositor in the host, and also for guests to make efficient use of the GPU.
I spent the last week mostly working on VC5 GL features and bugfixes again.
However, most of my time was actually spent on trying to track down my remaining GPU hangs. Not many tests hang, but fbo-generatemipmaps is one, and it’s a big component of piglit’s coverage. So far, I’ve definitely figured out that the hanging RCL has run to completion without error, but without the bit in the interrupt status being set (which is supposed to be set by the final command of the RCL). I tracked down that I had my DT wrong and was treating VC5 as edge-triggered rather than level, but that doesn’t seem to have helped.
On the VC4 front, I’ve been talking to someone trying to rig up other DSI panels to the RPi. It got me looking at my implementation again, and I finally found why my DSI transactions weren’t working: I was emitting the wrong type of transactions for the bridge! Switching from a DCS write to a generic write, the panel comes up fine and the flickering is gone.
| October 30, 2017 | |
I spent the last week entirely working on VC5 GL features and bugfixes.
| October 27, 2017 | |
The firmware blob that is needed by Broadcom devices is not supplied by default, and it has to be supplied manually.
Download BCM-0a5c-6410.hcd and copy it into /lib/firmware/brcm/ and then reboot your device.
wget https://memcpy.io/files/2017-10-28/BCM-0a5c-6410.hcd
sudo cp BCM-0a5c-6410.hcd /lib/firmware/brcm/
sudo chmod 0644 /lib/firmware/brcm/BCM-0a5c-6410.hcd
sudo reboot
| October 26, 2017 | |
libwacom has been around since 2011 now but I'm still getting the odd question or surprise at what libwacom does, is, or should be. So here's a short summary:
libwacom is a library that provides tablet descriptions but no actual tablet event handling functionality. Simply said, it's a library that provides axes to a bunch of text files. Graphics tablets are complex and to integrate them well we usually need to know more about them than the information the kernel reports. If you need to know whether the tablet is a standalone one (Wacom Intuos series) or a built-in one (Wacom Cintiq series), libwacom will tell you that. You need to know how many LEDs and mode groups a tablet has? libwacom will tell you that. You need an SVG to draw a representation of the tablet's button layout? libwacom will give you that. You need to know which stylus is compatible with your tablet? libwacom knows about that too.
But that's all it does. You cannot feed events to libwacom, and it will not initialise the device for you. It just provides static device descriptions.
If your tablet isn't working or the buttons aren't handled correctly, or the stylus is moving the wrong way, libwacom won't be able to help with that. As said above, it merely provides extra information about the device but is otherwise completely ignorant of the actual tablet.
Sure, it's named after Wacom tablets because that's where the majority of effort goes (not least because Wacom employs Linux developers!). But the description format is independent of the brand so you can add non-Wacom tablets to it too.
Caveat: many of the cheap non-Wacom tablets re-use USB ids so two completely different devices would have the same USB ID, making a static device description useless.
Right now, the two most prevalent users of libwacom are GNOME and libinput. GNOME's control center and mutter use libwacom for tablet-to-screen mappings as well as to show the various stylus capabilities. And it uses the SVG to draw an overlay for pad buttons. libinput uses it to associate the LEDs on the pad with the right buttons and to initialise the stylus tools axes correctly. The kernel always exposes all possible axes on the event node but not all styli have all axes. With libwacom, we can initialise the stylus tool based on the correct information.
So now I expect you to say something like "Oh wow, I'm like totally excited about libwacom now and I want to know more and get involved!". Well, fear not, there is more information and links to the repos in the wiki.
| October 24, 2017 | |

A recording of the talk can be found here.
If you're curios about the slides, you can download the PDF or the OTP.
This post has been a part of work undertaken by my employer Collabora.
I would like to thank the wonderful organizers of Embedded Linux Conference EU, for hosting a great community event.
| October 23, 2017 | |
The GNOME.Asia Summit 2017 organizers invited to me to speak at their conference in Chongqing/China, and it was an excellent event! Here's my brief report:

Because we arrived one day early in Chongqing, my GNOME friends Sri, Matthias, Jonathan, David and I started our journey with an excursion to the Dazu Rock Carvings, a short bus trip from Chongqing, and an excellent (and sometimes quite surprising) sight. I mean, where else can you see a buddha with 1000+ hands, and centuries old, holding a cell Nexus 5 cell phone? Here's proof:
The GNOME.Asia schedule was excellent, with various good talks, including some about Flatpak, Endless OS, rpm-ostree, Blockchains and more. My own talk was about The Path to a Fully Protected GNOME Desktop OS Image (Slides available here). In the hallway track I did my best to advocate casync to whoever was willing to listen, and I think enough were ;-). As we all know attending conferences is at least as much about the hallway track as about the talks, and GNOME.Asia was a fantastic way to meet the Chinese GNOME and Open Source communities.
The day after the conference the organizers of GNOME.Asia organized a Chongqing day trip. A particular highlight was the ubiqutious hot pot, sometimes with the local speciality: fresh pig brain.
Here some random photos from the trip: sights, food, social event and more.
I'd like to thank the GNOME Foundation for funding my trip to GNOME.Asia. And that's all for now. But let me close with an old chinese wisdom:
The Trials Of A Long Journey Always Feeling, Civilized Travel Pass Reputation.
For those living under a rock, the videos from everybody's favourite Userspace Linux Conference All Systems Go! 2017 are now available online.
The videos for my own two talks are available here:
Synchronizing Images with casync (Slides)
Containers without a Container Manager, with systemd (Slides)
Of course, this is the stellar work of the CCC VOC folks, who are hard to beat when it comes to videotaping of community conferences.
This week I mostly spent on the 7268, getting the GL driver stabilized on the actual HW.
First, I implemented basic overflow memory allocation, so we wouldn’t just hang and reset when that condition triggers. This let me complete an entire piglit run on the HW, which is a big milestone.
I also ended up debugging why the GPU reset on overflow wasn’t working before – when we reset we would put the current bin job onto the tail of the binner job list, so we’d just try it again later and hang again if that was the bad job. The intent of the original code had been to move it to the “done” list so it would get cleaned up without executing any more. However, that was also a problem – you’d end up behind by one in your seqnos completed, so BO idle would never work. Instead, I now just move it to the next stage of the execution pipeline with a “hung” flag to say “don’t actually execute anything from this”. This is a bug in vc4 as well, and I need to backport the fix.
Once I had reliable piglit, I found that there was an alignment requirement for default vertex attributes that I hadn’t uncovered in the simulator. By moving them to CSO time, I reduced the draw overhead in the driver and implicitly got the buffer aligned like we needed.
Additionally, I had implemented discards using conditional tile buffer writes, same as vc4. This worked fine on the simulator, but had no effect on the HW. Replacing that with SETMSF usage made the discards work, and probably reduced instruction count.
On the vc4 front, I merged the MADVISE code from Boris just in time for the last pull request for 4.15. I also got in the DSI transactions sleeping fix, so the code should now be ready for people to try hooking up random DSI panels to vc4.
| October 19, 2017 | |
So I have over the last few years blogged regularly about upcoming features in Fedora Workstation. Well I thought as we putting the finishing touches on Fedora Workstation 27 I should try to look back at everything we have achieved since Fedora Workstation was launched with Fedora 21. The efforts I highlight here are efforts where we have done significant or most development. There are of course a lot of other big changes that has happened over the last few years by the wider community that we leveraged and offer in Fedora Workstation, examples here include things like Meson and Rust. This post is not about those, but that said I do want to write a post just talking about the achievements of the wider community at some point, because they are very important and crucial too. And along the same line this post will not be speaking about the large number of improvements and bugfixes that we contributed to a long list of projects, like to GNOME itself. This blog is about taking stock and taking some pride in what we achieved so far and major hurdles we past on our way to improving the Linux desktop experience.
This blog is also slightly different from my normal format as I will not call out individual developers by name as I usually do, instead I will focus on this being a totality and thus just say ‘we’.
I am sure I missed something, but this is at least a decent list of Fedora Workstation highlights for the last few years. Next onto working on my Fedora Workstation 27 blogpost :)
| October 18, 2017 | |
Alberto Ruiz just announced Fleet Commander as production ready! Fleet Commander is our new tool for managing large deployments of Fedora Workstation and RHEL desktop systems. So get our to Albertos Fleet Commander blog post for all the details.
Something went incredibly right, and review feedback poured in last week and I got to merge a lot of code.
My VC5 GL driver’s patches for core Mesa got reviewed (thanks Rob Clark, Adam Jackson, and Emil Velikov), so I got to merge it to Mesa. It’s so nice to finally be able to work in tree instead of on a rebasing branch that breaks most weeks.
My GL_OES_required_internalformat got reviewed by Nicolai Hähnle, so I gave it another test run on the Intel CI farm (thanks, Mark Janes!) and merged. VC4 and VC5 now have proper 5551 texture format support, and VC4 conformance test failures with 565 are fixed.
My GL_MESA_tile_raster_order extension for overlapping blit support on VC4 got merged to Khronos’s git tree. Nicolai reviewed my Mesa implementation of the extension, so I’ve merged it. All that’s left for that is merging the X Server usage of it and pushing it on downstream to Raspbian.
I tested the fast mutex patch series for Mesa, and found a 4.3% (+/- .9%) improvement in 10x10 copywinwin on my Intel hardware. Hopefully this lands soon, since those performance improvements should show up on ARM as well.
On the VC5 front, I fixed VPM setup on actual HW (the simulator’s restrictions didn’t catch one of the HW requirements), getting a lot of tests that do gl_ModelViewProject * gl_Vertex to work. I played around with the new GPU reset code a bit, and it looks like the next step is to implement binner overflow handling.
I’ve been doing some more review feedback with Boris. We’re getting closer to merge on MADVISE, for sure. I respun my DSI transactions fix based on Boris’s feedback, and it’s even nicer now.
Next week: VC5 binner overflow handling, merging MADVISE, and hopefully putting together some Raspbian backports.
| October 12, 2017 | |
So I am really happy to announce another major codec addition to Fedora Workstation 27 namely the addition of the codec called AAC. As you might have seen from Tom Callaways announcement this has just been cleared for inclusion in Fedora.
For those not well versed in the arcane lore of audio codecs AAC is the codec used for things like iTunes and is found in a lot of general media files online. AAC stands for Advanced Audio Coding and was created by the MPEG working group as the successor to mp3. Especially due to Apple embracing the format there is a lot of files out there using it and thus we wanted to support it in Fedora too.
What we will be shipping in Fedora is a modified version of the AAC implementation released by Google, which was originally written by Frauenhoffer. On top of that we will of course be providing GStreamer plugins to enable full support for playing and creating AAC files for GStreamer applications.
Be aware though that AAC is a bit of an umbrella term for a lot of different technologies and thus you might be able to come across files that claims to use AAC, but which we can not play back. The most likely reason for that would be that it requires a AAC profile we do not support. The version of AAC that we will be shipping has also be carefully created to fit within the requirements for software in Fedora, so if you are a packager be aware that unlike with for instance mp3, this change does not mean you can package and ship any AAC implementation you want to in Fedora.
I am expecting to have more major codec announcements soon, so stay tuned :)
| October 10, 2017 | |
I spent this week in front of the VC5 hardware, working toward implementing GPU reset. I’m going to need reliable reset before I can start running GL testsuites.
Of course, to believe that GPU reset works, I’m going to want some tests that trigger it. I pulled the VC5 XML-based code-generation from Mesa into i-g-t, and built up basic rendering tests using it. It caught that I was misusing struct scatterlist (causing CPU page faults trying to fill the VC5 MMU). I also had mistyped a bit of the XML, calling a bitmask a bool so that the hardware tried to store render targets 1, 2, and 3, instead of 0 (causing GPU hangs). After stabilizing all that, building the hang testcase was pretty simple.
Taking a break from kernel adventures, I did a bit more work on the vc5 GL driver. Transform feedback now has many tests passing, provoking vertex is supported, float32 render targets are supported, MRTs are supported, colormasks are fixed for non-independent blending, a regression in blending with a 0 factor is fixed, and >32bpp clear colors are fixed.
I’ve got a new revision of Boris’s VC4 MADVISE work, and it’s looking good. Boris has also cleaned up some debug messages that have been spamming dmesg on the Raspberry Pi, which is great news for us.
I also spent quite some time reviewing Dylan’s meson work for Mesa. It’s a slog – build systems for a project of Mesa’s scale are huge, but I’ve seen what meson can get you in productivity gains from my conversion of the X Server, and it looks like we should be able to replace all 3(!) of Mesa’s build systems with this one.
Finally, on Friday I got the reviews necessary for the DSI panel driver, and I merged it to drm-misc-next to appear in the 4.15 kernel. We still need to figure out what to do about the devicetree for it (since it’s sort of an optional piece of hardware for the board, but more official than other hardware), but at least this is a lot less for downstreams to carry.
Next week: merging vc5 GL and working on actually performing GPU reset.
| October 08, 2017 | |
TL;DR: systemd now can do per-service IP traffic accounting, as well as access control for IP address ranges.
Last Friday we released systemd 235. I already blogged about its Dynamic User feature in detail, but there's one more piece of new functionality that I think deserves special attention: IP accounting and access control.
Before v235 systemd already provided per-unit resource management hooks for a number of different kinds of resources: consumed CPU time, disk I/O, memory usage and number of tasks. With v235 another kind of resource can be controlled per-unit with systemd: network traffic (specifically IP).
Three new unit file settings have been added in this context:
IPAccounting= is a boolean setting. If enabled for a unit, all IP
traffic sent and received by processes associated with it is counted
both in terms of bytes and of packets.
IPAddressDeny= takes an IP address prefix (that means: an IP
address with a network mask). All traffic from and to this address will be
prohibited for processes of the service.
IPAddressAllow= is the matching positive counterpart to
IPAddressDeny=. All traffic matching this IP address/network mask
combination will be allowed, even if otherwise listed in
IPAddressDeny=.
The three options are thin wrappers around kernel functionality
introduced with Linux 4.11: the control group eBPF hooks. The actual
work is done by the kernel, systemd just provides a number of new
settings to configure this facet of it. Note that cgroup/eBPF is
unrelated to classic Linux firewalling,
i.e. NetFilter/iptables. It's up to you whether you use one or the
other, or both in combination (or of course neither).
Let's have a closer look at the IP accounting logic mentioned
above. Let's write a simple unit
/etc/systemd/system/ip-accounting-test.service:
[Service]
ExecStart=/usr/bin/ping 8.8.8.8
IPAccounting=yes
This simple unit invokes the
ping(8) command to
send a series of ICMP/IP ping packets to the IP address 8.8.8.8 (which
is the Google DNS server IP; we use it for testing here, since it's
easy to remember, reachable everywhere and known to react to ICMP
pings; any other IP address responding to pings would be fine to use,
too). The IPAccounting= option is used to turn on IP accounting for
the unit.
Let's start this service after writing the file. Let's then have a
look at the status output of systemctl:
# systemctl daemon-reload
# systemctl start ip-accounting-test
# systemctl status ip-accounting-test
● ip-accounting-test.service
Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago
Main PID: 32152 (ping)
IP: 168B in, 168B out
Tasks: 1 (limit: 4915)
CGroup: /system.slice/ip-accounting-test.service
└─32152 /usr/bin/ping 8.8.8.8
Okt 09 18:05:47 sigma systemd[1]: Started ip-accounting-test.service.
Okt 09 18:05:47 sigma ping[32152]: PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
Okt 09 18:05:47 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=29.2 ms
Okt 09 18:05:48 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=28.0 ms
This shows the ping command running — it's currently at its second
ping cycle as we can see in the logs at the end of the output. More
interesting however is the IP: line further up showing the current
IP byte counters. It currently shows 168 bytes have been received, and
168 bytes have been sent. That the two counters are at the same value
is not surprising: ICMP ping requests and responses are supposed to
have the same size. Note that this line is shown only if
IPAccounting= is turned on for the service, as only then this data
is collected.
Let's wait a bit, and invoke systemctl status again:
# systemctl status ip-accounting-test
● ip-accounting-test.service
Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago
Main PID: 32152 (ping)
IP: 22.2K in, 22.2K out
Tasks: 1 (limit: 4915)
CGroup: /system.slice/ip-accounting-test.service
└─32152 /usr/bin/ping 8.8.8.8
Okt 09 18:10:07 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=260 ttl=59 time=27.7 ms
Okt 09 18:10:08 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=261 ttl=59 time=28.0 ms
Okt 09 18:10:09 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=262 ttl=59 time=33.8 ms
Okt 09 18:10:10 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=263 ttl=59 time=48.9 ms
Okt 09 18:10:11 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=264 ttl=59 time=27.2 ms
Okt 09 18:10:12 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=265 ttl=59 time=27.0 ms
Okt 09 18:10:13 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=266 ttl=59 time=26.8 ms
Okt 09 18:10:14 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=267 ttl=59 time=27.4 ms
Okt 09 18:10:15 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=268 ttl=59 time=29.7 ms
Okt 09 18:10:16 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=269 ttl=59 time=27.6 ms
As we can see, after 269 pings the counters are much higher: at 22K.
Note that while systemctl status shows only the byte counters,
packet counters are kept as well. Use the low-level systemctl show
command to query the current raw values of the in and out packet and
byte counters:
# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets
IPIngressBytes=37776
IPIngressPackets=449
IPEgressBytes=37776
IPEgressPackets=449
Of course, the same information is also available via the D-Bus
APIs. If you want to process this data further consider talking proper
D-Bus, rather than scraping the output of systemctl show.
Now, let's stop the service again:
# systemctl stop ip-accounting-test
When a service with such accounting turned on terminates, a log line
about all its consumed resources is written to the logs. Let's check
with journalctl:
# journalctl -u ip-accounting-test -n 5
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. --
Okt 09 18:15:50 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=603 ttl=59 time=26.9 ms
Okt 09 18:15:51 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=604 ttl=59 time=27.2 ms
Okt 09 18:15:52 sigma systemd[1]: Stopping ip-accounting-test.service...
Okt 09 18:15:52 sigma systemd[1]: Stopped ip-accounting-test.service.
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic
The last line shown is the interesting one, that shows the accounting data. It's actually a structured log message, and among its metadata fields it contains the more comprehensive raw data:
# journalctl -u ip-accounting-test -n 1 -o verbose
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. --
Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e]
PRIORITY=6
_BOOT_ID=4c7e7adcba0c45b69d612857270716d3
_MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3
_HOSTNAME=sigma
SYSLOG_FACILITY=3
SYSLOG_IDENTIFIER=systemd
_UID=0
_GID=0
_TRANSPORT=journal
_PID=1
_COMM=systemd
_EXE=/usr/lib/systemd/systemd
_CAP_EFFECTIVE=3fffffffff
_SYSTEMD_CGROUP=/init.scope
_SYSTEMD_UNIT=init.scope
_SYSTEMD_SLICE=-.slice
CODE_FILE=../src/core/unit.c
_CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25
_SELINUX_CONTEXT=system_u:system_r:init_t:s0
UNIT=ip-accounting-test.service
CODE_LINE=2115
CODE_FUNC=unit_log_resources
MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3
IP_METRIC_INGRESS_BYTES=50880
IP_METRIC_INGRESS_PACKETS=605
IP_METRIC_EGRESS_BYTES=50880
IP_METRIC_EGRESS_PACKETS=605
MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
_SOURCE_REALTIME_TIMESTAMP=1507565752649028
The interesting fields of this log message are of course
IP_METRIC_INGRESS_BYTES=, IP_METRIC_INGRESS_PACKETS=,
IP_METRIC_EGRESS_BYTES=, IP_METRIC_EGRESS_PACKETS= that show the
consumed data.
The log message carries a message
ID
that may be used to quickly search for all such resource log messages
(ae8f7b866b0347b9af31fe1c80b127c0). We can combine a search term for
messages of this ID with journalctl's -u switch to quickly find
out about the resource usage of any invocation of a specific
service. Let's try:
# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. --
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
Of course, the output above shows only one message at the moment, since we started the service only once, but a new one will appear every time you start and stop it again.
The IP accounting logic is also hooked up with
systemd-run,
which is useful for transiently running a command as systemd service
with IP accounting turned on. Let's try it:
# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf
Running as unit: run-u2761.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 878ms
IP traffic received: 231.0K
IP traffic sent: 3.7K
This uses wget to download the
PDF version of the 2nd day
schedule
of everybody's favorite Linux user-space conference All Systems Go!
2017 (BTW, have you already booked your
ticket? We are very close to
selling out, be quick!). The IP traffic this command generated was
231K ingress and 4K egress. In the systemd-run command line two
parameters are important. First of all, we use -p IPAccounting=yes
to turn on IP accounting for the transient service (as above). And
secondly we use --wait to tell systemd-run to wait for the service
to exit. If --wait is used, systemd-run will also show you various
statistics about the service that just ran and terminated, including
the IP statistics you are seeing if IP accounting has been turned on.
It's fun to combine this sort of IP accounting with interactive transient units. Let's try that:
# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u2779.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# dnf update
…
sh-4.4# dnf install firefox
…
sh-4.4# exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 5.297s
IP traffic received: …B
IP traffic sent: …B
This uses systemd-run's --pty switch (or short: -t), which opens
an interactive pseudo-TTY connection to the invoked service process,
which is a bourne shell in this case. Doing this means we have a full,
comprehensive shell with job control and everything. Since the shell
is running as part of a service with IP accounting turned on, all IP
traffic we generate or receive will be accounted for. And as soon as
we exit the shell, we'll see what it consumed. (For the sake of
brevity I actually didn't paste the whole output above, but truncated
core parts. Try it out for yourself, if you want to see the output in
full.)
Sometimes it might make sense to turn on IP accounting for a unit that
is already running. For that, use systemctl set-property
foobar.service IPAccounting=yes, which will instantly turn on
accounting for it. Note that it won't count retroactively though: only
the traffic sent/received after the point in time you turned it on
will be collected. You may turn off accounting for the unit with the
same command.
Of course, sometimes it's interesting to collect IP accounting data
for all services, and turning on IPAccounting=yes in every single
unit is cumbersome. To deal with that there's a global option
DefaultIPAccounting=
available which can be set in /etc/systemd/system.conf.
So much about IP accounting. Let's now have a look at IP access
control with systemd 235. As mentioned above, the two new unit file
settings, IPAddressAllow= and IPAddressDeny= maybe be used for
that. They operate in the following way:
If the source address of an incoming packet or the destination
address of an outgoing packet matches one of the IP addresses/network
masks in the relevant unit's IPAddressAllow= setting then it will be
allowed to go through.
Otherwise, if a packet matches an IPAddressDeny= entry configured
for the service it is dropped.
If the packet matches neither of the above it is allowed to go through.
Or in other words, IPAddressDeny= implements a blacklist, but
IPAddressAllow= takes precedence.
Let's try that out. Let's modify our last example above in order to get a transient service running an interactive shell which has such an access list set:
# systemd-run -p IPAddressDeny=any -p IPAddressAllow=8.8.8.8 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh
Running as unit: run-u2850.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# ping 8.8.8.8 -c1
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=27.9 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms
sh-4.4# ping 8.8.4.4 -c1
PING 8.8.4.4 (8.8.4.4) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 8.8.4.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping 127.0.0.2 -c1
PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms
--- 127.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms
sh-4.4# exit
The access list we set up uses IPAddressDeny=any in order to define
an IP white-list: all traffic will be prohibited for the session,
except for what is explicitly white-listed. In this command line, we
white-listed two address prefixes: 8.8.8.8 (with no explicit network
mask, which means the mask with all bits turned on is implied,
i.e. /32), and 127.0.0.0/8. Thus, the service can communicate with
Google's DNS server and everything on the local loop-back, but nothing
else. The commands run in this interactive shell show this: First we
try pinging 8.8.8.8 which happily responds. Then, we try to ping
8.8.4.4 (that's Google's other DNS server, but excluded from this
white-list), and as we see it is immediately refused with an Operation
not permitted error. As last step we ping 127.0.0.2 (which is on the
local loop-back), and we see it works fine again, as expected.
In the example above we used IPAddressDeny=any. The any
identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it's a
shortcut for everything, on both IPv4 and IPv6. A number of other
such shortcuts exist. For example, instead of spelling out
127.0.0.0/8 we could also have used the more descriptive shortcut
localhost which is expanded to 127.0.0.0/8 ::1/128, i.e. everything
on the local loopback device, on both IPv4 and IPv6.
Being able to configure IP access lists individually for each unit is
pretty nice already. However, typically one wants to configure this
comprehensively, not just for individual units, but for a set of units
in one go or even the system as a whole. In systemd, that's possible
by making use of
.slice
units (for those who don't know systemd that well, slice units are a
concept for organizing services in hierarchical tree for the purpose of
resource management): the IP access list in effect for a unit is the
combination of the individual IP access lists configured for the unit
itself and those of all slice units it is contained in.
By default, system services are assigned to
system.slice,
which in turn is a child of the root slice
-.slice. Either
of these two slice units are hence suitable for locking down all
system services at once. If an access list is configured on
system.slice it will only apply to system services, however, if
configured on -.slice it will apply to all user processes of the
system, including all user session processes (i.e. which are by
default assigned to user.slice which is a child of -.slice) in
addition to the system services.
Let's make use of this:
# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost
# systemctl set-property apache.service IPAddressAllow=10.0.0.0/8
The two commands above are a very powerful way to first turn off all IP communication for all system services (with the exception of loop-back traffic), followed by an explicit white-listing of 10.0.0.0/8 (which could refer to the local company network, you get the idea) but only for the Apache service.
After playing around a bit with this, let's talk about use-cases. Here are a few ideas:
The IP access list logic can in many ways provide a more modern replacement for the venerable TCP Wrapper, but unlike it it applies to all IP sockets of a service unconditionally, and requires no explicit support in any way in the service's code: no patching required. On the other hand, TCP wrappers have a number of features this scheme cannot cover, most importantly systemd's IP access lists operate solely on the level of IP addresses and network masks, there is no way to configure access by DNS name (though quite frankly, that is a very dubious feature anyway, as doing networking — unsecured networking even – in order to restrict networking sounds quite questionable, at least to me).
It can also replace (or augment) some facets of IP firewalling,
i.e. Linux NetFilter/iptables. Right now, systemd's access lists are
of course a lot more minimal than NetFilter, but they have one major
benefit: they understand the service concept, and thus are a lot more
context-aware than NetFilter. Classic firewalls, such as NetFilter,
derive most service context from the IP port number alone, but we live
in a world where IP port numbers are a lot more dynamic than they used
to be. As one example, a BitTorrent client or server may use any IP
port it likes for its file transfer, and writing IP firewalling rules
matching that precisely is hence hard. With the systemd IP access list
implementing this is easy: just set the list for your BitTorrent
service unit, and all is good.
Let me stress though that you should be careful when comparing NetFilter with systemd's IP address list logic, it's really like comparing apples and oranges: to start with, the IP address list logic has a clearly local focus, it only knows what a local service is and manages access of it. NetFilter on the other hand may run on border gateways, at a point where the traffic flowing through is pure IP, carrying no information about a systemd unit concept or anything like that.
It's a simple way to lock down distribution/vendor supplied system
services by default. For example, if you ship a service that you know
never needs to access the network, then simply set IPAddressDeny=any
(possibly combined with IPAddressAllow=localhost) for it, and it
will live in a very tight networking sand-box it cannot escape
from. systemd itself makes use of this for a number of its services by
default now. For example, the logging service
systemd-journald.service, the login manager systemd-logind or the
core-dump processing unit systemd-coredump@.service all have such a
rule set out-of-the-box, because we know that neither of these
services should be able to access the network, under any
circumstances.
Because the IP access list logic can be combined with transient
units, it can be used to quickly and effectively sandbox arbitrary
commands, and even include them in shell pipelines and such. For
example, let's say we don't trust our
curl implementation (maybe it
got modified locally by a hacker, and phones home?), but want to use
it anyway to download the the slides of my most recent casync
talk in order to
print it, but want to make sure it doesn't connect anywhere except
where we tell it to (and to make this even more fun, let's minimize
privileges further, by setting
DynamicUser=yes):
# systemd-resolve 0pointer.de
0pointer.de: 85.214.157.71
2a01:238:43ed:c300:10c3:bcf3:3266:da74
-- Information acquired via protocol DNS in 2.8ms.
-- Data is authenticated: no
# systemd-run --pipe -p IPAddressDeny=any \
-p IPAddressAllow=85.214.157.71 \
-p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \
-p DynamicUser=yes \
curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
So much about use-cases. This is by no means a comprehensive list of what you can do with it, after all both IP accounting and IP access lists are very generic concepts. But I do hope the above inspires your fantasy.
IP accounting and IP access control are primarily concepts for the
local administrator. However, As suggested above, it's a very good
idea to ship services that by design have no network-facing
functionality with an access list of IPAddressDeny=any (and possibly
IPAddressAllow=localhost), in order to improve the out-of-the-box
security of our systems.
An option for security-minded distributions might be a more radical
approach: ship the system with -.slice or system.slice configured
to IPAddressDeny=any by default, and ask the administrator to punch
holes into that for each network facing service with systemctl
set-property … IPAddressAllow=…. But of course, that's only an
option for distributions willing to break compatibility with what was
before.
A couple of additional notes:
IP accounting and access lists may be mixed with socket activation. In this case, it's a good idea to configure access lists and accounting for both the socket unit that activates and the service unit that is activated, as both units maintain fully separate settings. Note that IP accounting and access lists configured on the socket unit applies to all sockets created on behalf of that unit, and even if these sockets are passed on to the activated services, they will still remain in effect and belong to the socket unit. This also means that IP traffic done on such sockets will be accounted to the socket unit, not the service unit. The fact that IP access lists are maintained separately for the kernel sockets created on behalf of the socket unit and for the kernel sockets created by the service code itself enables some interesting uses. For example, it's possible to set a relatively open access list on the socket unit, but a very restrictive access list on the service unit, thus making the sockets configured through the socket unit the only way in and out of the service.
systemd's IP accounting and access lists apply to IP sockets only,
not to sockets of any other address families. That also means that
AF_PACKET (i.e. raw) sockets are not covered. This means it's a good
idea to combine IP access lists with RestrictAddressFamilies=AF_UNIX
AF_INET
AF_INET6
in order to lock this down.
You may wonder if the per-unit resource log message and
systemd-run --wait may also show you details about other types or
resources consumed by a service. The answer is yes: if you turn on
CPUAccounting= for a service, you'll also see a summary of consumed
CPU time in the log message and the command output. And we are
planning to hook-up IOAccounting= the same way too, soon.
Note that IP accounting and access lists aren't entirely free. systemd inserts an eBPF program into the IP pipeline to make this functionality work. However, eBPF execution has been optimized for speed in the last kernel versions already, and given that it currently is in the focus of interest to many I'd expect to be optimized even further, so that the cost for enabling these features will be negligible, if it isn't already.
IP accounting is currently not recursive. That means you cannot use a slice unit to join the accounting of multiple units into one. This is something we definitely want to add, but requires some more kernel work first.
You might wonder how the
PrivateNetwork=
setting relates to IPAccessDeny=any. Superficially they have similar
effects: they make the network unavailable to services. However,
looking more closely there are a number of
differences. PrivateNetwork= is implemented using Linux network
name-spaces. As such it entirely detaches all networking of a service
from the host, including non-IP networking. It does so by creating a
private little environment the service lives in where communication
with itself is still allowed though. In addition using the
JoinsNamespaceOf=
dependency additional services may be added to the same environment,
thus permitting communication with each other but not with anything
outside of this group. IPAddressAllow= and IPAddressDeny= are much
less invasive. First of all they apply to IP networking only, and can
match against specific IP addresses. A service running with
PrivateNetwork= turned off but IPAddressDeny=any turned on, may
enumerate the network interfaces and their IP configured even though
it cannot actually do any IP communication. On the other hand if you
turn on PrivateNetwork= all network interfaces besides lo
disappear. Long story short: depending on your use-case one, the other,
both or neither might be suitable for sand-boxing of your service. If
possible I'd always turn on both, for best security, and that's what
we do for all of systemd's own long-running services.
And that's all for now. Have fun with per-unit IP accounting and access lists!
| October 05, 2017 | |
TL;DR: you may now configure systemd to dynamically allocate a UNIX user ID for service processes when it starts them and release it when it stops them. It's pretty secure, mixes well with transient services, socket activated services and service templating.
Today we released systemd 235. Among other improvements this greatly extends the dynamic user logic of systemd. Dynamic users are a powerful but little known concept, supported in its basic form since systemd 232. With this blog story I hope to make it a bit better known.
The UNIX user concept is the most basic and well-understood security concept in POSIX operating systems. It is UNIX/POSIX' primary security concept, the one everybody can agree on, and most security concepts that came after it (such as process capabilities, SELinux and other MACs, user name-spaces, …) in some form or another build on it, extend it or at least interface with it. If you build a Linux kernel with all security features turned off, the user concept is pretty much the one you'll still retain.
Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today's UNIX systems
don't really use the user concept like that anymore though. Most of
today's systems probably have only one actual human user (or even
less!), but their user databases (/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Even though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.
The people behind the Android OS realized the relevance of the UNIX user concept as the primary security concept on UNIX, and took its use even further: on Android not only system services take benefit of the UNIX user concept, but each UI app gets its own, individual user identity too — thus neatly separating app resources from each other, and protecting app processes from each other, too.
Back in the more traditional Linux world things are a bit less advanced in this area. Even though users are the quintessential UNIX security concept, allocation and management of system users is still a pretty limited, raw and static affair. In most cases, RPM or DEB package installation scripts allocate a fixed number of (usually one) system users when you install the package of a service that wants to take benefit of the user concept, and from that point on the system user remains allocated on the system and is never deallocated again, even if the package is later removed again. Most Linux distributions limit the number of system users to 1000 (which isn't particularly a lot). Allocating a system user is hence expensive: the number of available users is limited, and there's no defined way to dispose of them after use. If you make use of system users too liberally, you are very likely to run out of them sooner rather than later.
You may wonder why system users are generally not deallocated when the package that registered them is uninstalled from a system (at least on most distributions). The reason for that is one relevant property of the user concept (you might even want to call this a design flaw): user IDs are sticky to files (and other objects such as IPC objects). If a service running as a specific system user creates a file at some location, and is then terminated and its package and user removed, then the created file still belongs to the numeric ID ("UID") the system user originally got assigned. When the next system user is allocated and — due to ID recycling — happens to get assigned the same numeric ID, then it will also gain access to the file, and that's generally considered a problem, given that the file belonged to a potentially very different service once upon a time, and likely should not be readable or changeable by anything coming after it. Distributions hence tend to avoid UID recycling which means system users remain registered forever on a system after they have been allocated once.
The above is a description of the status quo ante. Let's now focus on what systemd's dynamic user concept brings to the table, to improve the situation.
With systemd dynamic users we hope to make make it easier and cheaper to allocate system users on-the-fly, thus substantially increasing the possible uses of this core UNIX security concept.
If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
DynamicUser=
option in its [Service] section to yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
unused UID.
Now you may wonder, how does this concept deal with the sticky user issue discussed above? In order to counter the problem, two strategies easily come to mind:
Prohibit the service from creating any files/directories or IPC objects
Automatically removing the files/directories or IPC objects the service created when it shuts down.
In systemd we implemented both strategies, but for different parts of the execution environment. Specifically:
Setting DynamicUser=yes implies
ProtectSystem=strict
and
ProtectHome=read-only. These
sand-boxing options turn off write access to pretty much the whole OS
directory tree, with a few relevant exceptions, such as the API file
systems /proc, /sys and so on, as well as /tmp and
/var/tmp. (BTW: setting these two options on your regular services
that do not use DynamicUser= is a good idea too, as it drastically
reduces the exposure of the system to exploited services.)
Setting DynamicUser=yes implies
PrivateTmp=yes. This
option sets up /tmp and /var/tmp for the service in a way that it
gets its own, disconnected version of these directories, that are not
shared by other services, and whose life-cycle is bound to the
service's own life-cycle. Thus if the service goes down, the user is
removed and all its temporary files and directories with it. (BTW: as
above, consider setting this option for your regular services that do
not use DynamicUser= too, it's a great way to lock things down
security-wise.)
Setting DynamicUser=yes implies
RemoveIPC=yes. This
option ensures that when the service goes down all SysV and POSIX IPC
objects (shared memory, message queues, semaphores) owned by the
service's user are removed. Thus, the life-cycle of the IPC objects is
bound to the life-cycle of the dynamic user and service, too. (BTW:
yes, here too, consider using this in your regular services, too!)
With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/tmp and /var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with
effectively.
The
RuntimeDirectory=
option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
created below /run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service's dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service's own run-time. Example: set RuntimeDirectory=foobar in
your service, and watch how a directory /run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the DynamicUser= context
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)
Of course, a service running in such an environment (although already very useful for many cases!), has a major limitation: it cannot leave persistent data around it can reuse on a later run. As pretty much the whole OS directory tree is read-only to it, there's simply no place it could put the data that survives from one service invocation to the next.
With systemd 235 this limitation is removed: there are now three new
settings:
StateDirectory=,
LogsDirectory= and CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below /var/lib,
/var/log and /var/cache, respectively. There's one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.
Of course, the obvious question to ask now is: how do these three settings deal with the sticky file ownership problem?
For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it's even worse
than just getting access, due to the existence of setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
(0700 and root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.
How is that applied to dynamic user services? Let's say
StateDirectory=foobar is set for a service that has DynamicUser=
turned off. The instant the service is started, /var/lib/foobar is
created as state directory, owned by the service's user and remains in
existence when the service is stopped. If the same service now is run
with DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory /var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service's
dynamic user). The /var/lib/private directory is created as boundary
directory: it's owned by root:root, and has a restrictive access
mode of 0700. Both the symlink and the service's state directory will
survive the service's life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary
directory.
The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo /tmp and
/var/tmp as mentioned above), except for /var/lib/private, which
is over-mounted with a read-only tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this tmpfs file system instance another mount is
placed: a bind mount to the host's real /var/lib/private/foobar
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the /var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink /var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
indirection through /var/lib/private is mostly transparent.
This logic of course raises another question: what happens to the state directory if a dynamic user service is started with a state directory configured, gets UID X assigned on this first invocation, then terminates and is restarted and now gets UID Y assigned on the second invocation, with X ≠ Y? On the second invocation the directory — and all the files and directories below it — will still be owned by the original UID X so how could the second instance running as Y access it? Our way out is simple: systemd will recursively change the ownership of the directory and everything contained within it to UID Y before invoking the service's executable.
Of course, such recursive ownership changing (chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it's much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service's
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it's preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).
Note that CacheDirectory= and LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
below the /var/cache and /var/logs directories, and their boundary
directory hence is /var/cache/private and /var/log/private,
respectively.
So, after all this introduction, let's have a look how this all can be put together. Here's a trivial example:
# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
[Service]
ExecStart=/usr/bin/sleep 4711
DynamicUser=yes
EOF
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test
● dynamic-user-test.service
Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
Main PID: 2967 (sleep)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/dynamic-user-test.service
└─2967 /usr/bin/sleep 4711
Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
2967 sleep dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user
In this example, we create a unit file with DynamicUser= turned on,
start it, check if it's running correctly, have a look at the service
process' user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn't configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.
That's already pretty cool. Let's step it up a notch, by doing the
same in an interactive transient service (for those who don't know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the
systemd-run
command from the shell. Think: run a service without having to write a
unit file first):
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root root 60 6. Okt 13:21 .
drwxr-xr-x. 1 root root 852 6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750 8 6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0 6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root root 66 6. Okt 13:21 .
drwxr-xr-x. 1 root root 852 6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test
hello
The above invokes an interactive shell as transient service
run-u15750.service (systemd-run picked that name automatically,
since we didn't specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as /var/lib/wuff. In the interactive shell
running inside the service, the ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That's because the user
ceased to exist at this point, and "ls" shows the raw UID for files
owned by users that don't exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.
Now, let's see how things look if we start another transient service, reusing the state directory from the first invocation:
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
hello
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087 8 6. Okt 13:22 .
drwxr-xr-x. 3 root root 60 6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087 6 6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit
exit
Here, systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
recursive chown()ing was required.
And that's the end of our example, which hopefully illustrated a bit how this concept and implementation works.
Now that we had a look at how to enable this logic for a unit and how it is implemented, let's discuss where this actually could be useful in real life.
One major benefit of dynamic user IDs is that running a privilege-separated service leaves no artifacts in the system. A system user is allocated and made use of, but it is discarded automatically in a safe and secure way after use, in a fashion that is safe for later recycling. Thus, quickly invoking a short-lived service for processing some job can be protected properly through a user ID without having to pre-allocate it and without this draining the available UID pool any longer than necessary.
In many cases, starting a service no longer requires
package-specific preparation. Or in other words, quite often
useradd/mkdir/chown/chmod invocations in "post-inst" package
scripts, as well as
sysusers.d
and
tmpfiles.d
drop-ins become unnecessary, as the DynamicUser= and
StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the
necessary work automatically, on-demand and with a well-defined
life-cycle.
By combining dynamic user IDs with the transient unit concept, new
creative ways of sand-boxing are made available. For example, let's say
you don't trust the correct implementation of the sort command. You
can now lock it into a simple, robust, dynamic UID sandbox with a
simple systemd-run and still integrate it into a shell pipeline like
any other command. Here's an example, showcasing a shell pipeline
whose middle element runs as a dynamically on-the-fly allocated UID,
that is released when the pipelines ends.
# cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
By combining dynamic user IDs with the systemd templating logic it
is now possible to do much more fine-grained and fully automatic UID
management. For example, let's say you have a template unit file
/etc/systemd/system/foobard@.service:
[Service]
ExecStart=/usr/bin/myfoobarserviced
DynamicUser=1
StateDirectory=foobar/%i
Now, let's say you want to start one instance of this service for each of your customers. All you need to do now for that is:
# systemctl enable foobard@customerxyz.service --now
And you are done. (Invoke this as many times as you like, each time
replacing customerxyz by some customer identifier, you get the
idea.)
By combining dynamic user IDs with socket activation you may easily
implement a system where each incoming connection is served by a
process instance running as a different, fresh, newly allocated UID
within its own sandbox. Here's an example waldo.socket:
[Socket]
ListenStream=2048
Accept=yes
With a matching waldo@.service:
[Service]
ExecStart=-/usr/bin/myservicebinary
DynamicUser=yes
With the two unit files above, systemd will listen on TCP/IP port
2048, and for each incoming connection invoke a fresh instance of
waldo@.service, each time utilizing a different, new,
dynamically allocated UID, neatly isolated from any other
instance.
Dynamic user IDs combine very well with state-less systems,
i.e. systems that come up with an unpopulated /etc and /var. A
service using dynamic user IDs and the StateDirectory=,
CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts
will implicitly allocate the users and directories it needs for
running, right at the moment where it needs it.
Dynamic users are a very generic concept, hence a multitude of other uses are thinkable; the list above is just supposed to trigger your imagination.
I am pretty sure that a large number of services shipped with today's
distributions could benefit from using DynamicUser= and
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any sysusers.d
and tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
DynamicUser= and StateDirectory= (and friends) cannot or should
not be used. To name a few:
Service that need to write to files outside of /run/<package>,
/var/lib/<package>, /var/cache/<package>, /var/log/<package>,
/var/tmp, /tmp, /dev/shm are generally incompatible with this
scheme. This rules out daemons that upgrade the system as one example,
as that involves writing to /usr.
Services that maintain a herd of processes with different user IDs. Some SMTP services are like this. If your service has such a super-server design, UID management needs to be done by the super-server itself, which rules out systemd doing its dynamic UID magic for it.
Services which run as root (obviously…) or are otherwise privileged.
Services that need to live in the same mount name-space as the host
system (for example, because they want to establish mount points
visible system-wide). As mentioned DynamicUser= implies
ProtectSystem=, PrivateTmp= and related options, which all require
the service to run in its own mount name-space.
Your focus is older distributions, i.e. distributions that do not
have systemd 232 (for DynamicUser=) or systemd 235 (for
StateDirectory= and friends) yet.
If your distribution's packaging guides don't allow it. Consult your packaging guides, and possibly start a discussion on your distribution's mailing list about this.
A couple of additional, random notes about the implementation and use of these features:
Do note that allocating or deallocating a dynamic user leaves
/etc/passwd untouched. A dynamic user is added into the user
database through the glibc NSS module
nss-systemd,
and this information never hits the disk.
On traditional UNIX systems it was the job of the daemon process
itself to drop privileges, while the DynamicUser= concept is
designed around the service manager (i.e. systemd) being responsible
for that. That said, since v235 there's a way to marry DynamicUser=
and such services which want to drop privileges on their own. For
that, turn on DynamicUser= and set
User=
to the user name the service wants to setuid() to. This has the
effect that systemd will allocate the dynamic user under the specified
name when the service is started. Then, prefix the command line you
specify in
ExecStart=
with a single ! character. If you do, the user is allocated for the
service, but the daemon binary is invoked as root instead of the
allocated user, under the assumption that the daemon changes its UID
on its own the right way. Note that after registration the user will
show up instantly in the user database, and is hence resolvable like
any other by the daemon process. Example:
ExecStart=!/usr/bin/mydaemond
You may wonder why systemd uses the UID range 61184–65519 for its
dynamic user allocations (side note: in hexadecimal this reads as
0xEF00–0xFFEF). That's because distributions (specifically Fedora)
tend to allocate regular users from below the 60000 range, and we
don't want to step into that. We also want to stay away from 65535 and
a bit around it, as some of these UIDs have special meanings (65535 is
often used as special value for "invalid" or "no" UID, as it is
identical to the 16bit value -1; 65534 is generally mapped to the
"nobody" user, and is where some kernel subsystems map unmappable
UIDs). Finally, we want to stay within the 16bit range. In a user
name-spacing world each container tends to have much less than the full
32bit UID range available that Linux kernels theoretically
provide. Everybody apparently can agree that a container should at
least cover the 16bit range though — already to include a nobody
user. (And quite frankly, I am pretty sure assigning 64K UIDs per
container is nicely systematic, as the the higher 16bit of the 32bit
UID values this way become a container ID, while the lower 16bit
become the logical UID within each container, if you still follow what
I am babbling here…). And before you ask: no this range cannot be
changed right now, it's compiled in. We might change that eventually
however.
You might wonder what happens if you already used UIDs from the 61184–65519 range on your system for other purposes. systemd should handle that mostly fine, as long as that usage is properly registered in the user database: when allocating a dynamic user we pick a UID, see if it is currently used somehow, and if yes pick a different one, until we find a free one. Whether a UID is used right now or not is checked through NSS calls. Moreover the IPC object lists are checked to see if there are any objects owned by the UID we are about to pick. This means systemd will avoid using UIDs you have assigned otherwise. Note however that this of course makes the pool of available UIDs smaller, and in the worst cases this means that allocating a dynamic user might fail because there simply are no unused UIDs in the range.
If not specified otherwise the name for a dynamically allocated
user is derived from the service name. Not everything that's valid in
a service name is valid in a user-name however, and in some cases a
randomized name is used instead to deal with this. Often it makes
sense to pick the user names to register explicitly. For that use
User= and choose whatever you like.
If you pick a user name with User= and combine it with
DynamicUser= and the user already exists statically it will be used
for the service and the dynamic user logic is automatically
disabled. This permits automatic up- and downgrades between static and
dynamic UIDs. For example, it provides a nice way to move a system
from static to dynamic UIDs in a compatible way: as long as you select
the same User= value before and after switching DynamicUser= on,
the service will continue to use the statically allocated user if it
exists, and only operates in the dynamic mode if it does not. This is
useful for other cases as well, for example to adapt a service that
normally would use a dynamic user to concepts that require statically
assigned UIDs, for example to marry classic UID-based file system
quota with such services.
systemd always allocates a pair of dynamic UID and GID at the same time, with the same numeric ID.
If the Linux kernel had a "shiftfs" or similar functionality,
i.e. a way to mount an existing directory to a second place, but map
the exposed UIDs/GIDs in some way configurable at mount time, this
would be excellent for the implementation of StateDirectory= in
conjunction with DynamicUser=. It would make the recursive
chown()ing step unnecessary, as the host version of the state
directory could simply be mounted into a the service's mount
name-space, with a shift applied that maps the directory's owner to the
services' UID/GID. But I don't have high hopes in this regard, as all
work being done in this area appears to be bound to user name-spacing
— which is a concept not used here (and I guess one could say user
name-spacing is probably more a source of problems than a solution to
one, but you are welcome to disagree on that).
And that's all for now. Enjoy your dynamic users!
| October 04, 2017 | |
| October 02, 2017 | |
In the previous post in this series I introduced how to render the shadow map image, which is simply the depth information for the scene from the view point of the light. In this post I will cover how to use the shadow map to render shadows.
The general idea is that for each fragment we produce we compute the light space position of the fragment. In this space, the Z component tells us the depth of the fragment from the perspective of the light source. The next step requires to compare this value with the shadow map value for that same X,Y position. If the fragment’s light space Z is larger than the value we read from the shadow map, then it means that this fragment is behind an object that is closer to the light and therefore we can say that it is in the shadows, otherwise we know it receives direct light.
Let’s have a look at the vertex shader changes required for this:
void main()
{
vec4 pos = vec4(in_position.x, in_position.y, in_position.z, 1.0);
out_world_pos = Model * pos;
gl_Position = Projection * View * out_world_pos;
[...]
out_light_space_pos = LightViewProjection * out_world_pos;
}
The vertex shader code above only shows the code relevant to the shadow mapping technique. Model is the model matrix with the spatial transforms for the vertex we are rendering, View and Projection represent the camera’s view and projection matrices and the LightViewProjection represents the product of the light’s view and projection matrices. The variables prefixed with ‘out’ represent vertex shader outputs to the fragment shader.
The code generates the world space position of the vertex (world_pos) and clip space position (gl_Position) as usual, but then also computes the light space position for the vertex (out_light_space_pos) by applying the View and Projection transforms of the light to the world position of the vertex, which gives us the position of the vertex in light space. This will be used in the fragment shader to sample the shadow map.
The fragment shader will need to:
Transform the X,Y coordinates from NDC space [-1, 1] to texture space [0, 1].
Sample the shadow map and compare the result with the light space Z position we computed for this fragment to decide if the fragment is shadowed.
The implementation would look something like this:
float
compute_shadow_factor(vec4 light_space_pos, sampler2D shadow_map)
{
// Convert light space position to NDC
vec3 light_space_ndc = light_space_pos.xyz /= light_space_pos.w;
// If the fragment is outside the light's projection then it is outside
// the light's influence, which means it is in the shadow (notice that
// such sample would be outside the shadow map image)
if (abs(light_space_ndc.x) > 1.0 ||
abs(light_space_ndc.y) > 1.0 ||
abs(light_space_ndc.z) > 1.0)
return 0.0;
// Translate from NDC to shadow map space (Vulkan's Z is already in [0..1])
vec2 shadow_map_coord = light_space_ndc.xy * 0.5 + 0.5;
// Check if the sample is in the light or in the shadow
if (light_space_ndc.z > texture(shadow_map, shadow_map_coord.xy).x)
return 0.0; // In the shadow
// In the light
return 1.0;
}
The function returns 0.0 if the fragment is in the shadows and 1.0 otherwise. Note that the function also avoids sampling the shadow map for fragments that are outside the light’s frustum (and therefore are not recorded in the shadow map texture): we know that any fragment in this situation is shadowed because it is obviously not visible from the light. This assumption is valid for spotlights and point lights because in these cases the shadow map captures the entire influence area of the light source, for directional lights that affect the entire scene however, we usually need to limit the light’s frustum to the surroundings of the camera, and in that case we probably want want to consider fragments outside the frustum as lighted instead.
Now all that remains in the shader code is to use this factor to eliminate the diffuse and specular components for fragments that are in the shadows. To achieve this we can simply multiply these components by the factor computed by this function.
The list of changes in the main program are straight forward: we only need to update the pipeline layout and descriptors to attach the new resources required by the shaders, specifically, the light’s view projection matrix in the vertex shader (which could be bound as a push constant buffer or a uniform buffer for example) and the shadow map sampler in the fragment shader.
Binding the light’s ViewProjection matrix is no different from binding the other matrices we need in the shaders so I won’t cover it here. The shadow map sampler doesn’t really have any mysteries either, but since that is new let’s have a look at the code:
...
VkSampler sampler;
VkSamplerCreateInfo sampler_info = {};
sampler_info.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
sampler_info.addressModeU = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.addressModeV = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.addressModeW = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
sampler_info.anisotropyEnable = false;
sampler_info.maxAnisotropy = 1.0f;
sampler_info.borderColor = VK_BORDER_COLOR_INT_OPAQUE_BLACK;
sampler_info.unnormalizedCoordinates = false;
sampler_info.compareEnable = false;
sampler_info.compareOp = VK_COMPARE_OP_ALWAYS;
sampler_info.magFilter = VK_FILTER_LINEAR;
sampler_info.minFilter = VK_FILTER_LINEAR;
sampler_info.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
sampler_info.mipLodBias = 0.0f;
sampler_info.minLod = 0.0f;
sampler_info.maxLod = 100.0f;
VkResult result =
vkCreateSampler(device, &sampler_info, NULL, &sampler);
...
This creates the sampler object that we will use to sample the shadow map image. The address mode fields are not very relevant since our shader ensures that we do not attempt to sample outside the shadow map, we use linear filtering, but that is not mandatory of course, and we select nearest for the mipmap filter because we don’t have more than one miplevel in the shadow map.
Next we have to bind this sampler to the actual shadow map image. As usual in Vulkan, we do this with a descriptor update. For that we need to create a descriptor of type VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, and then do the update like this:
VkDescriptorImageInfo image_info; image_info.sampler = sampler; image_info.imageView = shadow_map_view; image_info.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL; VkWriteDescriptorSet writes; writes.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET; writes.pNext = NULL; writes.dstSet = image_descriptor_set; writes.dstBinding = 0; writes.dstArrayElement = 0; writes.descriptorCount = 1; writes.descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER; writes.pBufferInfo = NULL; writes.pImageInfo = &image_info; writes.pTexelBufferView = NULL; vkUpdateDescriptorSets(ctx->device, 1, &writes, 0, NULL);
A combined image sampler brings together the texture image to sample from (a VkImageView of the image actually) and the description of the filtering we want to use to sample that image (a VkSampler). As with all descriptor sets, we need to indicate its binding point in the set (in our case it is 0 because we have a separate descriptor set layout for this that only contains one binding for the combined image sampler).
Notice that we need to specify the layout of the image when it will be sampled from the shaders, which needs to be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
If you revisit the definition of our render pass for the shadow map image, you’ll see that we had it automatically transition the shadow map to this layout at the end of the render pass, so we know the shadow map image will be in this layout immediately after it has been rendered, so we don’t need to add barriers to execute the layout transition manually.
So that’s it, with this we have all the pieces and our scene should be rendering shadows now. Unfortunately, we are not quite done yet, if you look at the results, you will notice a lot of dark noise in surfaces that are directly lit. This is an artifact of shadow mapping called self-shadowing or shadow acne. The next section explains how to get rid of it.
Self-shadowing can happen for fragments on surfaces that are directly lit by a source light for which we are producing a shadow map. The reason for this is that these are the fragments’s Z coordinate in light space should exactly match the value we read from the shadow map for the same X,Y coordinates. In other words, for these fragments we expect:
light_space_ndc.z == texture(shadow_map, shadow_map_coord.xy).x.
However, due to different precession errors that can be generated on both sides of that equation, we may end up with slightly different values for each side and when the value we produce for light_space_ndc.z end ups being larger than what we read from the shadow map, even if it is a very small amount, it will mark the pixel as shadowed, leading to the result we see in that image.
The usual way to fix this problem involves adding a small depth offset or bias to the depth values we store in the shadow map so we ensure that we always read a larger value from the shadow map for the fragment. Another way to think about this is to think that when we record the shadow map, we push every object in the scene slightly away from the light source. Unfortunately, this depth offset bias should not be a constant value, since the angle between the surface normals and the vectors from the light source to the fragments also affects the bias value that we should use to correct the self-shadowing.
Thankfully, GPU hardware provides means to account for this. In Vulkan, when we define the rasterization state of the pipeline we use to create the shadow map, we can add the following:
VkPipelineRasterizationStateCreateInfo rs; ... rs.depthBiasEnable = VK_TRUE; rs.depthBiasConstantFactor = 4.0f; rs.depthBiasSlopeFactor = 1.5f;
Where depthBiasConstantFactor is a constant factor that is automatically added to all depth values produced and depthBiasSlopeFactor is a factor that is used to compute depth offsets also based on the angle. This provides us with the means we need without having to do any extra work in the shaders ourselves to offset the depth values correctly. In OpenGL the same functionality is available via glPolygonOffset().
Notice that the bias values that need to be used to obtain the best results can change for each scene. Also, notice that too big values can lead to shadows that are “detached” from the objects that cast them leading to very unrealistic results. This effect is also known as Peter Panning, and can be observed in this image:
As we can see in the image, we no longer have self-shadowing, but now we have the opposite problem: the shadows casted by the red and blue blocks are visibly incorrect, as if they were being rendered further away from the light source than they should be.
If the bias values are chosen carefully, then we should be able to get a good result, although some times we might need to accept some level of visible self-shadowing or visible Peter Panning:
The image above shows correct shadowing without any self-shadowing or visible Peter Panning. You may wonder why we can’t see some of the shadows from the red light in the floor where the green light is more intense. The reason is that even though it is not clear because I don’t actually render the objects projecting the lights, the green light is mostly looking down, so its reflection on the floor (that has normals pointing upwards) is strong enough that the contribution from the red light to the floor pixels in this area is insignificant in comparison making the shadows casted from the red light barely visible. You can still see some shadowing if you get close enough with the camera though, I promise 
The images above show aliasing around at the edges of the shadows. This happens because for each fragment we decide if it is shadowed or not as a boolean decision, and we use that result to fully shadow or fully light the pixel, leading to aliasing:
Another thing contributing to the aliasing effect is that a single pixel in the shadow map image can possibly expand to multiple pixels in camera space. That can happen if the camera is looking at an area of the scene that is close to the camera, but far away from the light source for example. In that case, the resolution of that area of the scene in the shadow map is small, but it is large for the camera, meaning that we end up sampling the same pixel from the shadow map to shadow larger areas in the scene as seen by the camera.
Increasing the resolution of the shadow map image will help with this, but it is not a very scalable solution and can quickly become prohibitive. Alternatively, we can implement something called Percentage-Closer Filtering to produce antialiased shadows. The technique is simple: instead of sampling just one texel from the shadow map, we take multiple samples in its neighborhood and average the results to produce shadow factors that do not need to be exactly 1 o 0, but can be somewhere in between, producing smoother transitions for shadowed pixels on the shadow edges. The more samples we take, the smoother the shadows edges get but do note that extra samples per pixel also come with a performance cost.
This is how we can update our compute_shadow_factor() function to add PCF:
float
compute_shadow_factor(vec4 light_space_pos,
sampler2D shadow_map,
uint shadow_map_size,
uint pcf_size)
{
vec3 light_space_ndc = light_space_pos.xyz /= light_space_pos.w;
if (abs(light_space_ndc.x) > 1.0 ||
abs(light_space_ndc.y) > 1.0 ||
abs(light_space_ndc.z) > 1.0)
return 0.0;
vec2 shadow_map_coord = light_space_ndc.xy * 0.5 + 0.5;
// compute total number of samples to take from the shadow map
int pcf_size_minus_1 = int(pcf_size - 1);
float kernel_size = 2.0 * pcf_size_minus_1 + 1.0;
float num_samples = kernel_size * kernel_size;
// Counter for the shadow map samples not in the shadow
float lighted_count = 0.0;
// Take samples from the shadow map
float shadow_map_texel_size = 1.0 / shadow_map_size;
for (int x = -pcf_size_minus_1; x <= pcf_size_minus_1; x++)
for (int y = -pcf_size_minus_1; y <= pcf_size_minus_1; y++) {
// Compute coordinate for this PFC sample
vec2 pcf_coord = shadow_map_coord + vec2(x, y) * shadow_map_texel_size;
// Check if the sample is in light or in the shadow
if (light_space_ndc.z <= texture(shadow_map, pcf_coord.xy).x)
lighted_count += 1.0;
}
return lighted_count / num_samples;
}
We now have a loop where we go through the samples in the neighborhood of the texel and average their respective shadow factors. Notice that because we sample the shadow map in texture space [0, 1], we need to consider the size of the shadow map image to properly compute the coordinates for the texels in the neighborhood so the application needs to provide this for every shadow map.
In this post we discussed how to use the shadow map image to produce shadows in the scene as well as typical issues that can show up with the shadow mapping technique, such as self-shadowing and aliasing, and how to correct them. This will be the last post in this series, there is a lot more stuff to cover about lighting and shadowing, such as Cascaded Shadow Maps (which I introduced briefly in this other post), but I think (or I hope) that this series provides enough material to get anyone interested in the technique a reference for how to implement it.
While I was waiting for review feedback I need on two patches (NIR alpha testing and VC5’s core autotools changes to enable the driver), I had some fun filling out more of the GL driver to reduce the piglit failure rate.
I’ve implemented generic tile lists (which allow multicore rendering), fixed base vertex, added base instance, fixed Z updates with discards, added 16-bit depth buffers, fixed depth and stencil clears, fixed interpolation qualifiers on deprecated color inputs, and started fixing the transform feedback support.
On the VC4 front, Boris has been busy working on the MADVISE ioctl. As with other GL drivers, we have a userspace BO cache, so that you don’t have to page fault in new buffer objects in every frame. However, if we call the new madvise ioctl when putting BOs in the cache, the kernel gets permission to free our buffers if it needs to (for example, when CMA has run out of space for a new X11 window you opened). He’s submitted kernel, userspace code, and tests and had some feedback so far.
Maxime has also started up on the vc4 project, working on getting the Chamelium test environment working on the Raspberry Pi’s HDMI output. I’m hoping Chamelium can help us stabilize HDMI in the long term, but for right now the goal is to use it to build tests for displaying VC4’s tiling formats (particularly the SAND tiling format from the media decode engine).
I also got some good news from Raspberry Pi, that Dave Stevenson has VCSM successfully importing dma-bufs. This is a major step for displaying or GL rendering from of the media decode engine’s buffers.
Finally, I took the next step in merging my MESA_tile_raster_order extension: Submitting a pull request to Khronos to add the spec. Hopefully we’ll be able to merge the code for faster uncomposited X11 soon.
| September 28, 2017 | |
L
ast year, the AppStream specification gained proper support for adding metadata for fonts, after Richard Hughes did some work on it years ago. We weren’t happy with how fonts were handled at that time, so we searched for better solutions, which is why this took a bit longer to be done. Last year, I was implementing the final support for fonts in both appstream-generator (the metadata extractor used by Debian and a few others) as well as the AppStream specification. This blogpost was sitting on my todo list as a draft for a long time now, and I only just now managed to finish it, so sorry for announcing this so late. Fonts are already available via AppStream for a year, and this post just sums up the status quo and some neat tricks if you want to write metainfo files for fonts. If you are following AppStream (or the Debian fonts list), you know everything already
.
Both Richard and I first tried to extract all the metadata to display fonts in a proper way to the users from the font files directly. This turned out to be very difficult, since font metadata is often wrong or incomplete, and certain desirable bits of metadata (like a longer description) are missing entirely. After messing around with different ways to solve this for days (afterall, by extracting the data from font files directly we would have hundreds of fonts directly available in software centers), I also came to the same conclusion as Richard: The best and easiest solution here is to mandate the availability of metainfo files per font.
Which brings me to the second issue: What is a font? For any person knowing about fonts, they will understand one font as one font face, e.g. “Lato Regular Italic” or “Lato Bold”. A user however will see the font family as a font, e.g. just “Lato” instead of all the font faces separated out. Since AppStream data is used primarily by software centers, we want something that is easy for users to understand. Hence, an AppStream “font” components really describes a font family or collection of fonts, instead of individual font faces. We do also want AppStream data to be useful for system components looking for a specific font, which is why font components will advertise the individual font face names they contain via a
<provides/>-tag. Naming fonts and making them identifiable is a whole other issue, I used a document from Adobe on font naming issues as a rough guideline while working on this.
How to write a good metainfo file for a font is best shown with an example. Lato is a well-looking font family that we want displayed in a software center. So, we write a metainfo file for it an place it in
/usr/share/metainfo/com.latofonts.Lato.metainfo.xmlfor the AppStream metadata generator to pick up:
<?xml version="1.0" encoding="UTF-8"?>
<component type="font">
<id>com.latofonts.Lato</id>
<metadata_license>FSFAP</metadata_license>
<project_license>OFL-1.1</project_license>
<name>Lato</name>
<summary>A sanserif typeface family</summary>
<description>
<p>
Lato is a sanserif typeface family designed in the Summer 2010 by Warsaw-based designer
Łukasz Dziedzic (“Lato” means “Summer” in Polish). In December 2010 the Lato family
was published under the open-source Open Font License by his foundry tyPoland, with
support from Google.
</p>
</description>
<url type="homepage">http://www.latofonts.com/</url>
<provides>
<font>Lato Regular</font>
<font>Lato Black Italic</font>
<font>Lato Black</font>
<font>Lato Bold Italic</font>
<font>Lato Bold</font>
<font>Lato Hairline Italic</font>
...
</provides>
</component>When the file is processed, we know that we need to look for fonts in the package it is contained in. So, the appstream-generator will load all the fonts in the package and render example texts for them as an image, so we can show users a preview of the font. It will also use heuristics to render an “icon” for the respective font component using its regular typeface. Of course that is not ideal – what if there are multiple font faces in a package? What if the heuristics fail to detect the right font face to display?
This behavior can be influenced by adding
<font/>tags to a
<provides/>tag in the metainfo file. The font-provides tags should contain the fullnames of the font faces you want to associate with this font component. If the font file does not define a fullname, the family and style are used instead. That way, someone writing the metainfo file can control which fonts belong to the described component. The metadata generator will also pick the first mentioned font name in the
<provides/>list as the one to render the example icon for. It will also sort the example text images in the same order as the fonts are listed in the provides-tag.
The example lines of text are written in a language matching the font using Pango.
But what about symbolic fonts? Or fonts where any heuristic fails? At the moment, we see ugly tofu characters or boxes instead of an actual, useful representation of the font. This brings me to an inofficial extension to font metainfo files, that, as far as I know, only appstream-generator supports at the moment. I am not happy enough with this solution to add it to the real specification, but it serves as a good method to fix up the edge cases where we can not render good example images for fonts. AppStream-Generator supports the FontIconText and FontSampleText custom AppStream properties to allow metainfo file authors to override the default texts and autodetected values. FontIconText will override the characters used to render the icon, while FontSampleText can be a line of text used to render the example images. This is especially useful for symbolic fonts, where the heuristics usually fail and we do not know which glyphs would be representative for a font.
For example, a font with mathematical symbols might want to add the following to its metainfo file:
<custom> <value key="FontIconText">∑√</value> <value key="FontSampleText">∑ ∮ √ ‖...‖ ⊕ 𝔼 ℕ ⋉</value> </custom>
Any unicode glyphs are allowed, but asgen will but some length restrictions on the texts.
So, In summary:

| September 26, 2017 | |
The All Systems Go! 2017 schedule has been published!
I am happy to announce that we have published the All Systems Go! 2017 schedule! We are very happy with the large number and the quality of the submissions we got, and the resulting schedule is exceptionally strong.
Without further ado:
Here's the schedule for the first day (Saturday, 21st of October).
And here's the schedule for the second day (Sunday, 22nd of October).
Here are a couple of keywords from the topics of the talks: 1password, azure, bluetooth, build systems, casync, cgroups, cilium, cockpit, containers, ebpf, flatpak, habitat, IoT, kubernetes, landlock, meson, OCI, rkt, rust, secureboot, skydive, systemd, testing, tor, varlink, virtualization, wifi, and more.
Our speakers are from all across the industry: Chef CoreOS, Covalent, Facebook, Google, Intel, Kinvolk, Microsoft, Mozilla, Pantheon, Pengutronix, Red Hat, SUSE and more.
For further information about All Systems Go! visit our conference web site.
Make sure to buy your ticket for All Systems Go! 2017 now! A limited number of tickets are left at this point, so make sure you get yours before we are all sold out! Find all details here.
See you in Berlin!
Last week I was at the X Developers Conference, where I gave a talk about the vc4 and vc5 projects (video).
Probably the best result of that talk was a developer mentioning to me that the AMDGPU kernel driver’s execbuf model resolved my concerns about the VC5 submit ioctl I was building: All private buffers are implicitly included in an exec, the kernel tracks a list of evicted buffers to be brought back in (probably empty), and it uses a single reservation object for all the non-shared buffers in the address space, so that there are no O(n) walks in the kernel!
As far as code goes, I’m continuing to make progress on the VC5 Vulkan port. This so far involves copying code from the Intel (“anv”) driver trying to get the symbols all implemented, and replacing the core structures I think I need to (batchbuffers and relocation lists turned into a set of command lists and buffers referenced, but without relocations). I’ve also picked a few bits out of the AMD (“radv”) driver, where they’ve done things I need that Intel didn’t have.
Next up in userspace is to get the VC5 code merged (I’m still blocked on review of the NIR alphatest pass, and the patch where I touch common build system stuff to hook the vc5 driver into it), get bcmv to the point where it links (buffer layout is the next big chunk of code to write), finish off state emission, and then start trying to get my first Vulkan triangle rendered.
| September 21, 2017 | |
HUION PenTablet devices are graphics tablet devices aimed at artists. These tablets tend to aim for the lower end of the market, driver support is often somewhere between meh and disappointing. The DIGImend project used to take care of them, but with that out of the picture, the bugs bubble up to userspace more often.
The most common bug at the moment is a lack of proximity events. On pen devices like graphics tablets, we expect a BTN_TOOL_PEN event whenever the pen goes in or out of the detectable range of the tablet ('proximity'). On most devices, proximity does not imply touching the surface (that's BTN_TOUCH or a pressure-based threshold), on anything that's not built into a screen proximity without touching the surface is required to position the cursor correctly. libinput relies on proximity events to provide the correct tool state, which again is relied upon by compositors and clients.
The broken HUION devices only send BTN_TOOL_PEN once whenever the pen first goes into proximity and then never again until the device is disconnected. To make things more fun, HUION re-uses USB ids, so we cannot even reliably detect the broken devices and do the usual approach to hardware-quirking. So far, libinput support for HUION devices has thus been spotty. The good news is that libinput git master (and thus libinput 1.9) will have a fix for this. The one thing we can rely on is that tablets keep sending events at the device's scanout frequency. So in libinput we now add a timeout to the tablets and assume proximity-out has happened. libinput fakes a proximity out event and waits for the next event from the tablet - at which point we'll fake a proximity in before processing the events. This is enabled on all HUION devices now (re-using USB IDs, remember?) but not on any other device.
One down, many more broken devices more to go. Yay.
| September 20, 2017 | |
| September 19, 2017 | |
In quite a few blog posts I been referencing Pipewire our new Linux infrastructure piece to handle multimedia under Linux better. Well we are finally ready to formally launch pipewire as a project and have created a Pipewire website and logo.
To give you all some background, Pipewire is the latest creation of GStreamer co-creator Wim Taymans. The original reason it was created was that we realized that as desktop applications would be moving towards primarly being shipped as containerized Flatpaks we would need something for video similar to what PulseAudio was doing for Audio. As part of his job here at Red Hat Wim had already been contributing to PulseAudio for a while, including implementing a new security model for PulseAudio to ensure we could securely have containerized applications output sound through PulseAudio. So he set out to write Pipewire, although initially the name he used was PulseVideo. As he was working on figuring out the core design of PipeWire he came to the conclusion that designing Pipewire to just be able to do video would be a mistake as a major challenge he was familiar with working on GStreamer was how to ensure perfect audio and video syncronisation. If both audio and video could be routed through the same media daemon then ensuring audio and video worked well together would be a lot simpler and frameworks such as GStreamer would need to do a lot less heavy lifting to make it work. So just before we starting sharing the code publicaly we renamed the project to Pinos, named after Pinos de Alhaurín, a small town close to where Wim is living in southern Spain. In retrospect Pinos was probably not the worlds best name :)
Anyway as work progressed Wim decided to also take a look at Jack, as supporting the pro-audio usecase was an area PulseAudio had never tried to do, yet we felt that if we could ensure Pipewire supported the pro-audio usecase in addition to consumer level audio and video it would improve our multimedia infrastructure significantly and ensure pro-audio became a first class citizen on the Linux desktop. Of course as the scope grew the development time got longer too.
Another major usecase for Pipewire for us was that we knew that with the migration to Wayland we would need a new mechanism to handle screen capture as the way it was done under X was very insecure. So Jonas Ådahl started working on creating an API we could support in the compositor and use Pipewire to output. This is meant to cover both single frame capture like screenshot, to local desktop recording and remoting protocols. It is important to note here that what we have done is not just implement support for a specific protocol like RDP or VNC, but we ensured there is an advaned infrastructure in place to support any protocol on top of. For instance we will be working with the Spice team here at Red Hat to ensure SPICE can take advantage of Pipewire and the new API for instance. We will also ensure Chrome and Firefox supports this so that you can share your Wayland desktop through systems such as Blue Jeans.
Where we are now
So after multiple years of development we are now landing Pipewire in Fedora Workstation 27. This initial version is video only as that is the most urgent thing we need supported for Flatpaks and Wayland. So audio is completely unaffected by this for now and rolling that out will require quite a bit of work as we do not want to risk breaking audio on your system as a result of this change. We know that for many the original rollout of PulseAudio was painful and we do not want a repeat of that history.
So I strongly recommend grabbing the Fedora Workstation 27 beta to test pipewire and check out the new website at Pipewire.org and the initial documentation at the Pipewire wiki. Especially interesting is probably the pages that will eventually outline our plans for handling PulseAudio and JACK usecases.
If you are interested in Pipewire please join us on IRC in #pipewire on freenode. Also if things goes as planned Wim will be on Linux Unplugged tonight talking to Chris Fisher and the Unplugged crew about Pipewire, so tune in!
| September 14, 2017 | |
| September 12, 2017 | |
So our graphics team is looking for a new Senior Software Engineer to help with our AMD GPU support, including GPU compute. This is a great opportunity to join a versatile and top notch development team who plays a crucial role in making sure Linux has up-to-date and working graphics support and who are deeply involved with most major new developments in Linux graphics.
Also as a piece of advice when you read the job advertisement remember that it is very rare anyone can tick all the boxes in the requirement list, so don’t hesitate to apply just because you don’t fit the description and requirements perfectly. For example even if you are more junior in terms of years you could still be a great candidate if you for instance participated in GPU related Google Summer of Code projects or just as a community contributor. And for this position we are open to candidates from around the globe interested in working as remotees, although as always if you are willing or interested in joining one of our development offices in either Boston-USA, Brisbane-Australia or Brno-Czech Republic that is a plus of course.
So please check out the job advertisement forSenior Software Engineer and see if it could be your chance to join the worlds premier open source company.
| September 10, 2017 | |
If you're curios about the slides, you can download the PDF or the OTP.
This post has been a part of work undertaken by my employer Collabora.
I would like to thank the wonderful organizers of Open Source Summit NA, for hosting a great community event.
| September 09, 2017 | |
| September 05, 2017 | |
This week I worked on stabilizing the VC5 DRM. I got the MMU support almost working – rendering triangles, but eventually crashing the kernel, and then decided to simplify my plan a bit. First, a bit of background on DRM memory management:
One of the things I’m excited about for VC5 is having each process have a separate address space. It has always felt wrong to me on older desktop chips that we would let any DRI client read/write the contents of other DRI clients’ memory. We just didn’t have any hardware support to help us protect them, without effectively copying out and zeroing the other clients’ on-GPU memory. With i915 we gained page tables that we could swap out, at least, but we didn’t improve the DRM to do this for a long time (I’m actually not entirely sure if we even do so now). One of our concerns was that it could be a cost increase when switching between clients.
However, one of the benefits of having each process have a separate address space is that then we can give the client addresses for its buffer that it can always rely on. Instead of each reference to a buffer needing to be tracked so that kernel (or userspace, with new i915 ABI) can update them when the buffer address changes, we just keep track of the list of all buffers that have been referenced and need to be at their given offsets. This should be a win for CPU overhead.
What I built at first was each VC5 DRM fd having its own page table and address space that it would bind buffers into. The problem was that the page tables are expensive (up to 4MB of contiguous memory), and so I’d need to keep the page tables separate from the address spaces so that I could allocate them on demand.
After a bit of hacking on that plan, I decided to simplify things for now. Given that today’s boards have less than 4 GB of memory, I could have all the clients share the same page tables for a 4GB address space and use the GMP (an 8kb bitmask for 128kb-granularity access control) to mask each client’s access to the address space. With a shared page table and the GMP disabled I soon got this stable enough to do basic piglit texturing tests reliably. The GMP plan should also reduce our context switch overhead, by replacing the small 8kb GMP memory instead of flushing all the MMU caches.
Somewhere down the line, if we find we need 4GB per client, we can build kernel support to have clients that exhaust the shared table get kicked out to their own page tables.
Next up for the kernel module: GPU reset (so I can start running piglit tests and settling my V3D DT binding), and filling out those GMP bitmasks.
The last couple of days of the week were spent on forking the anv driver to start building a VC5 Vulkan driver (“bcmv”). I’ll talk about it soon (next week is my bike trip, so no TWIV then, and after that is XDC 2017).
| August 31, 2017 | |
I would like to thank the wonderful organizers, GDG Berlin Android, for hosting a great community event.
If you're curios about the slides, you can download the PDF or the OTP.
| August 29, 2017 | |
The All Systems Go! 2017 Call for Participation is Closing on September 3rd!
Please make sure to get your presentation proprosals forAll Systems Go! 2017 in now! The CfP closes on sunday!
In case you haven't heard about All Systems Go! yet, here's a quick reminder what kind of conference it is, and why you should attend and speak there:
All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward. All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd. All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.
In particular, we are looking for sessions including, but not limited to, the following topics:
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.
To submit your proposal now please visit our CFP submission web site.
For further information about All Systems Go! visit our conference web site.
systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!
| August 28, 2017 | |
This week was fairly focused on getting my BCM7268 board up and running and building the kernel module for it. I first got the board booting to an NFS root on an upstream kernel – huge thanks to the STB team for the upstreaming work they’ve done, so that I could just TFTP load an upstream kernel and start working.
The second half of the week was building the DRM module. I started with copying from vc4, since our execution model is very similar. We have an MMU now, so I replaced the CMA helpers with normal shmem file allocation, and started building bits we’ll need to have a page table per DRM file descriptor. I also ripped out the display components – we’re 3D only now, and any display hardware would be in a separate kernel module and communicate through dma-bufs. Finally, I put together a register header, debugfs register decode, and part of the interrupt handlers. Next week it’s going to be getting our first interrupts handled and filling out the MMU, then I should be on to trying to actually paint some pixels!
| August 27, 2017 | |
![]() |
| A flamegraph from the shader-db run, since every blog post needs a catchy picture. |
| August 25, 2017 | |
In this last week of my GSoC project I aimed at bringing my code into its final form. The goals for that are simple:
Regarding the remaining issues in my test scenarios there are still some problems with some of my test applications. As long as the underlying problem isn’t identified the question here is always though, if my code is the problem or the application is not working correctly. For example with KWin’s Wayland session there is a segmentation fault at the end of its run time whenever at some point during run time an Xwayland based client did a flip on a sub-surface. But according to GDB the segmentation fault is caused by KWin’s own egl context destruction and happens somewhere in the depths of the EGL DRI2 implementation in MESA.
I tried to find the reasons for some difficile issues like this one in the last few weeks, but at some point for a few of them I had to admit that either the tools I’m using aren’t capable enough to hint me in the right direction or I’m simply not knowledgeable enough to know where to look at. I hope that after I have sent in the patches in the next few days I get some helpful advice from senior developers who know the code better than I do and I can then solve the remaining problems.
The patches I will send to the mailing list in the next days are dependent on each other. As an overview here is how I planned them:
This partitioning will hopefully help the reviewers. The other advantage is that mistakes I may have made in one of these patches and have been overlooked in the review process might be easier to get found afterwards. Splitting my code changes up into smaller units also gives me the opportunity to look into the structure of my code from a different perspective one more time and fix details I may have overlooked until now.
I hope being able to send the patches in tomorrow. I’m not sure if I’m supposed to write one more post next week after the GSoC project has officially ended. But in any case I plan on writing one more post at some point in the future about the reaction to my patches and if and how they got merged in the end.
| August 23, 2017 | |
| August 21, 2017 | |
This week I finished fixing regressions from the VC5 QPU instruction scheduler, and polished the vc5 series up so I could post it for review in Mesa. I landed a few initial bits that wouldn’t affect anyone else.
I also took another look at my DSI series. I had previously tried to work around the boot time dependency circle by letting the DSI device be created before the DSI host showed up. That proved to be fragile as well (in particular it would have had issues if the host had to -EPROBE_DEFER for some other reason), so I went back and made VC4 advertise a DSI host before any of the rest of the DSI encoder was ready. This proved to be not very hard, and I’m hoping once again that this is the final version of the series.
Other bits this week:
| August 18, 2017 | |
The last week of GSoC 2017 is about to begin. My project is in a pretty good state I would say: I have created a big solution for the Xwayland Present support, which is integrated firmly and not just attached to the main code path like an afterthought. But there are still some issues to sort out. Especially the correct cleanup of objects is difficult. That’s only a problem with sub-surfaces though. So, if I’m not able to solve these issues in the next few days I’ll just allow full window flips. This would still include all full screen windows and for example also the Steam client as it’s directly rendering its full windows without usage of the compositor.
I still hope though to solve the last problems with the sub-surfaces, since this would mean that we can in all direct rendering cases on Wayland use buffer flips, which would be a huge improvement compared to native X.
In any case at first I’ll send the final patch for the Present extension to the xorg-devel mailing list. This patch will add a separate mode for flips per window to the extension code. After that I’ll send the patches for Xwayland, either with or without sub-surface support.
That’s already all for now as a last update before the end and with still a big decision to be made. In one week I can report back on what I chose and how the final code looks like.
| August 16, 2017 | |
![]() |
I'm currently at DebConf 17 in Montréal, back at DebConf for the first time in 10 years (last time was DebConf 7 in Edinburgh). It's great to put names to faces and meet more of my co-developers in person!
On Monday I gave a talk entitled “A Debian maintainer's guide to Flatpak”, aiming to introduce Debian developers to Flatpak, and show how Flatpak and Debian (and Debian derivatives like SteamOS) can help each other. It seems to have been quite well received, with people generally positive about the idea of using Flatpak to deliver backports and faster-moving leaf packages (games!) onto the stable base platform that Debian is so good at providing.
A video of the talk is available from the Debian
Meetings Archive.
I've also put up my slides in the DebConf git-annex repository, with
some small edits to link to more source code:
A Debian maintainer's guide to Flatpak.
Source code for the slides
is also available from Collabora's git server.
The next step is to take my proof-of-concept for building Flatpak runtimes and apps from Debian and SteamOS packages, flatdeb, get it a bit more production-ready, and perhaps start publishing some sample runtimes from a cron job on a Debian or Collabora server. (By the way, if you downloaded that source right after my talk, please update - I've now pushed some late changes that were necessary to fix the 3D drivers for my OpenArena demo.)
I don't think Debian will be going quite as far as Endless any time soon: as Cosimo outlined in the talk right before mine, they deploy their Debian derivative as an immutable base OS with libOSTree, with all the user-installable modules above that coming from Flatpak. That model is certainly an interesting thing to think about for Debian derivatives, though: at Collabora we work on a lot of appliance-like embedded Debian derivatives, with a lot of flexibility during development but very limited state on deployed systems, and Endless' approach seems a perfect fit for those situations.
[Edited 2017-08-16 to fix the link for the slides, and add links for the video]
| August 14, 2017 | |
This week was spent almost entirely on the VC5 QPU instruction scheduler. This is what packs together the ADD and MUL instruction components and the signal flags (LDUNIF, LDVPM, LDTMU, THRSW, etc.) into the final sequence of 64-bit instructions.
I based it on the VC4 scheduler, which in turn is based on i965’s. Being the 5th of these I’ve worked on, it would sure be nice if it felt like less copy and paste, but it’s almost all very machine-dependent code and I haven’t come up with a way to reduce the duplication.
The initial results are great:
instructions in affected programs: 116269 -> 71677 (-38.35%)
but I’ve still got a handful of regressions to fix.
Next up for scheduling is to fill the thrsw and branch delay slots. I also need to pack more than 2 things into one QPU instruction – right we pick two of ADD, MUL, LDUNIF, and LDVPM, but we could do an ADD, MUL, LDVPM, and LDUNIF all together. That should be a small change from here.
Other bits this week:
| August 11, 2017 | |
One more time I decided to start from the beginning and try another even more radical approach to my Xwayland GSoC project than the last time. I have now basically written a full API inside the Present extension, with which modes of presentation can be added. There are of course only two modes right now: The default screen presenting mode as how it worked until now and the new one for Xwayland to present on individual windows and without the need of them being full screen. While this was also possible with the version from last week, the code is now substantially better structured.
I’m currently still in a phase of testing so I won’t write much more about it for now. Instead I want to talk about one very persistent bug, which popped up seemingly from one day to the other in my KWin session and hindered my work immensely.
This is also a tale of how difficult it can be to find the exact reason for a bug. Especially when there is not much information to work with: As said the problem came out of nowhere. I had used Neverball to test my Xwayland code in the past all the time. But starting a few days ago whenever I selected a level and the camera was to pan down to the level the whole KWin session blocked and I could only hard reboot the whole computer or SIGKILL the KWin process via SSH. The image of the level was frozen and keyboard inputs didn’t work anymore. That said I still heard the Neverball music playing in the background, so the application wasn’t the problem. And Xwayland or KWin didn’t quit with a segfault, they just stopped doing stuff.
So I began the search. Of course I first suspected my own code to be the problem. But when I tried the Xwayland master branch I experienced the same problem. But please, what was that? Why suddenly didn’t Neverball work at all anymore? I had used it in the past all the time, but now everything blocks? So I tried first to roll back commits in the master branches of Xwayland, KWin, KWayland in the last few weeks, thinking that the problem must have been introduced at that point in time because I could still use Neverball just recently without problems, right?
But the problem was still there. So I went further back. It worked with my distribution’s Xwayland and when manually testing through the input related commits to Xwayland starting from somewhere at the beginning of the year I finally found the responsible commit, or so I thought. But yeah, before that commit no blocking, with it blocking, so there has to be an error with this particular commit, right? But on the other side why could I use Neverball just one week ago without blocking and this commit is several months old?
Nevertheless I studied the Xwayland input code thoroughly. The documentation for this stuff is non-existent and the function patterns confusing, but with time I understood it well enough to decide that this couldn’t be the cause of the problem. Another indicator for that was, that Weston worked. The problem had to be in KWin or KWayland somewhere. After lots of time I also finally understood somewhat why I still could use Neverball a few days ago but now not at all anymore: I always started KWin from the terminal before that without launching a full Wayland Plasma session. But after here everything worked fine I switched to testing it in the Plasma session and apparently missed that from now on the problem existed. So was it Plasma itself? But this wasn’t really possible, since Plasma is a different process.
I didn’t want to give up and so I looked through the KWayland and KWin code related to pointer locking and confinement, which is a lot. Hours later I finally found the root cause: KWin creates small on screen notifications when a pointer is locked or confined to a window. Most of the time this works without problem, but with the above patch to Xwayland the client sends in quick succession the pointer confine and lock requests to KWin and for some reason when trying to show both notifications at the same time KWin or maybe the QML engine for the notification can’t process any further. Without the patch Xwayland always only sent the confinement request and nothing blocked. I don’t know how Martin would like to have this issue solved so I created a bug report for now. It’s weird that it was such a petty cause in the end with such huge consequences, but that’s how it goes.
| planet.fd.o | ||
|
planet.freedesktop.org is powered by Venus,
and the freedesktop.org community.
|
||