planet.freedesktop.org
 July 04, 2015
So, I realized it has been a while since posting about freedreno progress, so in honor of US independence day I figured it was as good an excuse as any for an update about independence from gpu blob driver for snapdragon/adreno..

Back in end of March 2015 at ELC, I gave a freedreno update presentation at ELC, listing the following major tasks left for gles3 support:
• Uniform Buffer Objects (UBO)
• Transform Feedback (TF)
• Multi-Render-Target (MRT)
• Multisample anti-aliasing (MSAA)
• NV_conditional_render
• 32b depth (z32 and z32_s8) (which I forgot to mention in the presentation)
EDIT: Ilia pointed out that 32b depth is needed for gles3 too, and gl3 additionally needs clipdist/etc (which we'll have to emulate, but hopefully can do in a generic nir pass) and rgtc (which will need sw decompression hopefully in mesa core so other drivers for gles class hw can reuse).  Original list was based on what mesa's compute_version() code was checking quite some time back.

Since then, we've gained support for UBO's (a3xx by Ilia Mirkin, and a4xx), MRT (for a3xx and core, again thanks to Ilia.. still needs to be wired up for a4xx), 32b depth (a3xx and core, again thanks to Ilia), and I've finished up shader compiler for loops/flow-control for ir3 (a3xx/a4xx).  The shader compiler work was a somewhat larger task than I expected (and I did expect it to be a lot of work), but it also involved moving over to NIR, in addition to re-writing the scheduler and register allocation passes, as well as a lot of re-org to ir3 in order to support multiple basic blocks.  The move to NIR was not strictly required, but it brings a lot of benefits in the form of shared support for conversion to SSA, scalarizing, CSE, DCE, constant folding, and algebraic optimizations.  And I figured it was less work in the long run to move to NIR first and drop the TGSI frontend, before doing all the refactoring needed to support loops and non-lowerable flow-control.  Incidentally, the compiler work should make the shader-compiler part of TF easier (since we need to generate a conditional write to TF buffer iff not overwriting past the end of the TF buffer).

In the mean time, freedreno and drm/msm have also gained support for the a306 gpu found in the new dragonboard 410c.  This board is a nice new low cost ($75) snapdragon community board based on the 64bit snapdragon 410. And thanks to a lot of work by linaro and qualcomm, the upstream kernel situation for this board is looking pretty good. It is shipping initially with a 4.0 based kernel (with patches on top for stuff that hadn't yet been merged for 4.0, including a lot of stuff backported from 4.1 and 4.2), including gpu/display/audio/video-codec/etc. I believe that the 4.1 kernel was the first version where a vanilla kernel could boot on db410c with basic stuff (like serial console) working. The kernel support for the gpu and display, other than the adv7533 hdmi bridge chip) landed in 4.2. There is still more work to get *everything* (including audio, vidc, etc) merged upstream, but work continues in that direction, making this quite an exciting board. Also, we have a GSoC student, Varad, working on freedreno support for android. It is still in early stages, with some debugging still to do, but he has made a lot of progress and things are starting to work. And since no blog post is complete without some nice screenshots... the other day someone pointed me at a post in the dolphin forums about how dolphin was running on a420 (same device as in the ifc6540). We all had a good laugh about the rendering issues with the blob driver. But, since dolphin was the first gl3 game that worked with freedreno, I was curious how freedreno would do.. so I fired up the ifc6540 and replayed some dolphin fifo logs that would let me render approximately the same scenes: Yoshi looks to be rendering pretty well.. digimon has a bit of corruption, but no where near as bad as the blob driver. I suspect the issue with digimon is an instruction scheduling issue in the shader compiler (well, no rest for the gpu driver writers), but nice to see that it is already in pretty good shape. Now we just need steam store or some unigine demos for arm linux :-P In type-theoretic terms, Monte has a very boring type system. All objects expressible in Monte form a set, Mont, which has some properties, but not anything interesting from a theoretical point of view. I plan to talk about Mont later, but for now we'll just consider it to be a way for me to make existential or universal claims about Monte's object model. Let's start with guards. Guards are one of the most important parts of writing idiomatic Monte, and they're also definitely an integral part of Monte's safety and security guarantees. They look like types, but are they actually useful as part of a type system? Let's consider the following switch expression: switch (x): match c :Char: "It's a character" match i :Int: "It's an integer" match _: "I don't know what it is!"  The two guards, Char and Int, perform what amounts to a type discrimination. We might have an intuition that if x were to pass Char, then it would not pass Int, and vice versa; we might also have an intuition that the order of checking Char and Int does not matter. I'm going to formalize these and show how strong they can be in Monte. When a coercion happens, the object being coerced is called the specimen. The result of the coercion is called the prize. You've already been introduced to the guard, the object which is performing the coercion. It happens that a specimen might override a Miranda method, _conformTo/1, in order to pass guards that it cannot normally pass. We call all such specimens conforming. All specimens that pass a guard also conform to it, but some non-passing specimens might still be able to conform by yielding a prize to the guard. Here's an axiom of guards: For all objects in Mont, if some object specimen conforms to a guard G, and def prize := G.coerce(specimen, _), then prize passes G. This cannot be proven by any sort of runtime assertion (yet?), but any guard that does not obey this axiom is faulty. One expects that a prize returned from a coercion passes the guard that was performing the coercion; without this assumption, it would be quite foolhardy to trust any guard at all! With that in mind, let's talk about properties of guards. One useful property is idempotence. An idempotent guard G is one that, for all objects in Mont which pass G, any such object specimen has the equality G.coerce(specimen, _) == specimen. (Monte's equality, if you're unfamiliar with it, considers two objects to be equal if they cannot be distinguished by any response to any message sent at them. I could probably craft equivalency classes out of that rule at some point in the future.) Why is idempotency good? Well, it formalizes the intuition that objects aren't altered when coerced if they're already "of the right type of object." I expect that if I pass 42 to a function that has the pattern x :Int, I might reasonably expect that x will get 42 bound to it, and not 420 or some other wrong number. Monte's handling of state is impure. This complicates things. Since an object's internal state can vary, its willingness to respond to messages can vary. Let's be more precise in our definition of passing coercion. An object specimen passes coercion by a guard G if, for some combination of specimen and G internal states, G.coerce(specimen, _) == specimen. If specimen passes for all possible combinations of specimen and G internal states, then we say that specimen always passes coercion by G. (And if specimen cannot pass coercion with any possible combination of states, then it never passes.) Now we can get to retractability. A idempotent guard G is unretractable if, for all objects in Mont which pass coercion by G, those objects always pass coercion by G. The converse property, that it's possible for some object to pass but not always pass coercion, would make G retractable. An unretractable guard provides a very comfortable improvement over an idempotent one, similar to dipping your objects in DeepFrozen. I think that most of the interesting possibilities for guards come from unretractable guards. Most of the builtin guards are unretractable, too; data guards like Double and Str are good examples. Theorem: An unretractable guard G partitions Mont into two disjoint subsets whose members always pass or never pass coercion by G, respectively. The proof is pretty trivial. This theorem lets us formalize the notion of a guard as protecting a section of code from unacceptable values; if Char is unretractable (and it is!), then a value guarded by Char is always going to be a character and never anything else. This theorem also gives us our first stab at a type declaration, where we might say something like "An object is of type Char if it passes Char." Now let's go back to the beginning. We want to know how Char and Int interact. So, let's define some operations analagous to set union and intersection. The union of two unretractable guards G and H is written Any[G, H] and is defined as an unretractable guard that partitions Mont into the union of the two sets of objects that always pass G or H respectively, and all other objects. A similar definition can be created for the intersection of G and H, written All[G, H] and creating a similar partition with the intersection of the always-passing sets. Both union and intersection are semigroups on the set of unretractable guards. (I haven't picked a name for this set yet. Maybe Mont-UG?) We can add in identity elements to get monoids. For union, we can use the hypothetical guard None, which refuses to pass any object in Mont, and for intersection, the completely real guard Any can be used. object None: to coerce(_, ej): throw(ej, "None shall pass")  It gets better. The operations are also closed over Mont-UG, and it's possible to construct an inverse of any unretractable guard which is also an unretractable guard: def invertUG(ug): return object invertedUG: to coerce(specimen, ej): escape innerEj: ug.coerce(specimen, innerEj) throw(ej, "Inverted") catch _: return specimen  This means that we have groups! Two lovely groups. They're both Abelian, too. Exciting stuff. And, in the big payoff of the day, we get two rings on Mont-UG, depending on whether you want to have union or intersection as your addition or multiplication. This empowers a programmer, informally, to intuit that if Char and Int are disjoint (and, in this case, they are), then it might not matter in which order they are placed into the switch expression. That's all for now!  June 30, 2015 So this will be the first in a series of blogs talking about some major initiatives we are doing for Fedora Workstation. Today I want to present and talk about a thing we call Pinos. So what is Pinos? One of the original goals of Pinos was to provide the same level of advanced hardware handling for Video that PulseAudio provides for Audio. For those of you who has been around for a while you might remember how you once upon a time could only have one application using the sound card at the same time until PulseAudio properly fixed that. Well Pinos will allow you to share your video camera between multiple applications and also provide an easy to use API to do so. Video providers and consumers are implemented as separate processes communicating with DBUS and exchanging video frames using fd passing. Some features of Pinos • Easier switching of cameras in your applications • It will also allow you to more easily allow applications to switch between multiple cameras or mix the content from multiple sources. • Multiple types of video inputs • Supports more than cameras. Pinos also supports other type of video sources, for instance it can support your desktop as a video source. • GStreamer integration • Pinos is built using GStreamer and also have GStreamer elements supporting it to make integrating it into GStreamer applications simple and straightforward. • Pinos got some audio support • Well it tries to solve some of the same issues for video that PulseAudio solves for audio. Namely letting you have multiple applications sharing the same camera hardware. Pinos does also include audio support in order to let you handle both. What do we want to do with this in Fedora Workstation? • One thing we know is of great use and importance for many of our users, including many developers who wants to make videos demonstrating their software, is to have better screen capture support. One of the test cases we are using for Pinos is to improve the built in screen casting capabilities of GNOME 3, the goal being to reducing overhead and to allow for easy setup of picture in picture capturing. So you can easily set it up so there will be a camera capturing your face and voice and mixing that into your screen recording. • Video support for Desktop Sandboxes. We have been working for a while on providing technology for sandboxing your desktop applications and while we with a little work can use PulseAudio for giving the sandboxed applications audio access we needed something similar for video. Pinos provides us with such a solution. Who is working on this? Pinos is being designed and written by Wim Taymans who is the co-creator of the GStreamer multimedia framework and also a regular contributor to the PulseAudio project. Wim is also the working for Red Hat as a Principal Engineer, being in charge of a lot of our multimedia support in both Red Hat Enterprise Linux and Fedora. It is also worth nothing that it draws many of its ideas from an early prototype by William Manley called PulseVideo and builds upon some of the code that was merged into GStreamer due to that effort. Where can I get the code? The code is currently hosteed in Wim’s private repository on freedesktop. You can get it at cgit.freedesktop.org/~wtay/pinos. How can I get involved or talk to the author You can find Wim on Freenode IRC, he uses the name wtay and hangs out in both the #gstreamer and #pulseaudio IRC channels. Once the project is a bit further along we will get some basic web presence set up and a mailing list created. FAQ If Pinos contains Audio support will it eventually replace PulseAudio too? Probably not, the usecases and goals for the two systems are somewhat different and it is not clear that trying to make Pinos accommodate all the PulseAudio usescases would be worth the effort or possible withour feature loss. So while there is always a temptation to think ‘hey, wouldn’t it be nice to have one system that can handle everything’ we are at this point unconvinced that the gain outweighs the pain. Will Pinos offer re-directing kernel APIs for video devices like PulseAudio does for Audio? In order to handle legacy applications? No, that was possible due to the way ALSA worked, but V4L2 doesn’t have such capabilities and thus we can not take advantage of them. Why the name Pinos? The code name for the project was PulseVideo, but to avoid confusion with the PulseAudio project and avoid people making to many assumptions based on the name we decided to follow in the tradition of Wayland and Weston and take inspiration from local place names related to the creator. So since Wim lives in Pinos de Alhaurin close to Malaga in Spain we decided to call the project Pinos. Pinos is the word for pines in Spanish  June 26, 2015 The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go. First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2. There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have. Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtko. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skylake has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skylake have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver. There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual. A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release. Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.  June 25, 2015 One of the bits we are currently finalising in libinput are touchpad gestures. Gestures on a normal touchscreens are left to the compositor and, in extension, to the client applications. Touchpad gestures are notably different though, they are bound to the location of the pointer or the keyboard focus (depending on the context) and they are less context-sensitive. Two fingers moving together on a touchscreen may be two windows being moved at the same time. On a touchpad however this is always a pinch. Touchpad gestures are a lot more hardware-sensitive than touchscreens where we can just forward the touch points directly. On a touchpad we may have to consider software buttons or just HW-limitations of the touchpad. This prevents the implementation of touchpad gestures in a higher level - only libinput is aware of the location, size, etc. of software buttons. Hence - touchpad gestures in libinput. The tree is currently sitting here and is being rebased as we go along, but we're expecting to merge this into master soon. The interface itself is fairly simple: any device that may send gestures will have the LIBINPUT_DEVICE_CAP_GESTURE capability set. This is currently only implemented for touchpads but there is the potential to support this on other devices too. Two gestures are supported: swipe and pinch (+rotate). Both come with a finger count and both follow a Start/Update/End cycle. Gestures have a finger count that remains the same for the gestures, so if you switch from a two-finger pinch to a three-finger pinch you will see one gesture end and the next one start. Note that how to deal with this is up to the caller - it may very well consider this the same gesture semantically. Swipe gestures have delta coordinates (horizontally and vertically) of the logical center of the gesture, compared to the previous event. A pinch gesture has the delta coordinates too and a delta angle (clockwise, in degrees). A pinch gesture also has the notion of an absolute scale, the Begin event always has a scale of 1.0 and that changes as the fingers move towards each other further apart. A scale of 2.0 means they're now twice as far apart as originally. Nothing overly exciting really, it's a simple API that provides a couple of basic elements of data. Once integrated into the desktop properly, it should provide for some improved navigation. OS X has had this for a log time now and it's only time we caught up.  June 20, 2015 You’re a developer and you know AF_UNIX? You used it occasionally in your code, you know how high-level IPC puts marshaling on top and generally have a confident feeling when talking about it? But you actually have no clue what this fancy new kdbus is really about? During discussions you just nod along and hope nobody notices? Good. This is how it should be! As long as you don’t work on IPC libraries, there’s absolutely no requirement for you to have any idea what kdbus is. But as you’re reading this, I assume you’re curious and want to know more. So lets pick you up at AF_UNIX and look at a simple example. ## AF_UNIX Imagine a handful of processes that need to talk to each other. You have two options: Either you create a separate socket-pair between each two processes, or you create just one socket per process and make sure you can address all others via this socket. The first option will cause a quadratic growth of sockets and blows up if you raise the number of processes. Hence, we choose the latter, so our socket allocation looks like this: int fd = socket(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC | SOCK_NONBLOCK, 0); Simple. Now we have to make sure the socket has a name and others can find it. We choose to not pollute the file-system but rather use the managed abstract namespace. As we don’t care for the exact names right now, we just let the kernel choose one. Furthermore, we enable credential-transmission so we can recognize peers that we get messages from: struct sockaddr_un address = { .sun_family = AF_UNIX }; int enable = 1; setsockopt(fd, SOL_SOCKET, SO_PASSCRED, &enable, sizeof(enable)); bind(fd, (struct sockaddr*)&address, sizeof(address.sun_family)); By omitting the sun_path part of the address, we tell the kernel to pick one itself. This was easy. Now we’re ready to go so lets see how we can send a message to a peer. For simplicity, we assume we know the address of the peer and it’s stored in destination. struct sockaddr_un destination = { .sun_family = AF_UNIX, .sun_path = "..." }; sendto(fd, "foobar", 7, MSG_NOSIGNAL, (struct sockaddr*)&destination, sizeof(destination)); …and that’s all that is needed to send our message to the selected destination. On the receiver’s side, we call into recvmsg to receive the first message from our queue. We cannot use recvfrom as we want to fetch the credentials, too. Furthermore, we also cannot know how big the message is, so we query the kernel first and allocate a suitable buffer. This could be avoided, if we knew the maximum package size. But lets be thorough and support unlimited package sizes. Also note that recvmsg will return any next queued message. We cannot know the sender beforehand, so we also pass a buffer to store the address of the sender of this message: char control[CMSG_SPACE(sizeof(struct ucred))]; struct sockaddr_un sender = {}; struct ucred creds = {}; struct msghdr msg = {}; struct iovec iov = {}; struct cmsghdr *cmsg; char *message; ssize_t l; int size; ioctl(fd, SIOCINQ, &size); message = malloc(size + 1); iov.iov_base = message; iov.iov_len = size; msg.msg_name = (struct sockaddr*)&sender; msg.msg_namelen = sizeof(sender); msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_control = control; msg.msg_controllen = sizeof(control); l = recvmsg(fd, msg, MSG_CMSG_CLOEXEC); for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_CREDENTIALS) memcpy(&creds, CMSG_DATA(cmsg), sizeof(creds)); } printf("Message: %s (length: %zd uid: %u sender: %s)\n", message, l, creds.uid, sender.sun_path + 1); free(message); That’s it. With this in place, we can easily send arbitrary messages between our peers. We have no length restriction, we can identify the peers reliably and we’re not limited by any marshaling. Sure, we now dropped error-handling, event-loop integration and ignored some nasty corner cases, but that can all be solved. The code stays mostly the same. Congratulations! You now understand kdbus. In kdbus: • socket(AF_UNIX, SOCK_DGRAM, 0) becomes open(“/sys/fs/kdbus/….”, …) • bind(fd, …, …) becomes ioctl(fd, KDBUS_CMD_HELLO, …) • sendto(fd, …) becomes ioctl(fd, KDBUS_CMD_SEND, …) • recvmsg(fd, …) becomes ioctl(fd, KDBUS_CMD_RECV, …) Granted, the code will look slightly different. However, the concept stays the same. You still transmit raw messages, no marshaling is mandated. You can transmit credentials and file-descriptors, you can specify the peer to send messages to and you got to do that all through a single file descriptor. Doesn’t sound complex, does it? So if kdbus is actually just like AF_UNIX+SOCK_DGRAM, why use it? ## kdbus The AF_UNIX setup described above has some significant flaws: • Messages are buffered in the receive-queue, which is relatively small. If you send a message to a peer with a full receive-queue, you will get EAGAIN. However, since your socket is not connected, you cannot wait for POLLOUT as it does not exist for unconnected sockets (which peer would you wait for?). Hence, you’re basically screwed if remote peers do not clear their queues. This is a really severe restriction which makes this model useless for such setups. • There is no policy regarding who can talk to whom. In the abstract namespace, everyone can talk to anyone in the same network namespace. This can be avoided by placing sockets in the file system. However, then you lose the auto-cleanup feature of the abstract namespace. Furthermore, you’re limited to file-system policies, which might not be suitable. • You can only have a single name per socket. If you implement multiple services that should be reached via different names, then you need multiple sockets (and thus you lose global message ordering). • You cannot send broadcasts. You cannot even easily enumerate peers. • You cannot get notifications if peers die. However, this is crucial if you provide services to them and need to clean them up after they disconnected. kdbus solves all these issues. Some of these issues could be solved with AF_UNIX (which, btw., would look a lot like AF_NETLINK), some cannot. But more importantly, AF_UNIX was never designed as a shared bus and hence should not be used as such. kdbus, on the other hand, was designed with a shared bus model in mind and as such avoids most of these issues. With that basic understanding of kdbus, next time a news source reports about the “crazy idea to shove DBus into the kernel”, I hope you’ll be able to judge for yourself. And if you want to know more, I recommend building the kernel documentation, or diving into the code.  June 18, 2015 With the new v221 release of systemd we are declaring the sd-bus API shipped with systemd stable. sd-bus is our minimal D-Bus IPC C library, supporting as back-ends both classic socket-based D-Bus and kdbus. The library has been been part of systemd for a while, but has only been used internally, since we wanted to have the liberty to still make API changes without affecting external consumers of the library. However, now we are confident to commit to a stable API for it, starting with v221. In this blog story I hope to provide you with a quick overview on sd-bus, a short reiteration on D-Bus and its concepts, as well as a few simple examples how to write D-Bus clients and services with it. # What is D-Bus again? Let's start with a quick reminder what D-Bus actually is: it's a powerful, generic IPC system for Linux and other operating systems. It knows concepts like buses, objects, interfaces, methods, signals, properties. It provides you with fine-grained access control, a rich type system, discoverability, introspection, monitoring, reliable multicasting, service activation, file descriptor passing, and more. There are bindings for numerous programming languages that are used on Linux. D-Bus has been a core component of Linux systems since more than 10 years. It is certainly the most widely established high-level local IPC system on Linux. Since systemd's inception it has been the IPC system it exposes its interfaces on. And even before systemd, it was the IPC system Upstart used to expose its interfaces. It is used by GNOME, by KDE and by a variety of system components. D-Bus refers to both a specification, and a reference implementation. The reference implementation provides both a bus server component, as well as a client library. While there are multiple other, popular reimplementations of the client library – for both C and other programming languages –, the only commonly used server side is the one from the reference implementation. (However, the kdbus project is working on providing an alternative to this server implementation as a kernel component.) D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However, the protocol may be used on top of TCP/IP as well. It does not natively support encryption, hence using D-Bus directly on TCP is usually not a good idea. It is possible to combine D-Bus with a transport like ssh in order to secure it. systemd uses this to make many of its APIs accessible remotely. A frequently asked question about D-Bus is why it exists at all, given that AF_UNIX sockets and FIFOs already exist on UNIX and have been used for a long time successfully. To answer this question let's make a comparison with popular web technology of today: what AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX sockets/FIFOs only shovel raw bytes between processes, D-Bus defines actual message encoding and adds concepts like method call transactions, an object system, security mechanisms, multicasting and more. From our 10year+ experience with D-Bus we know today that while there are some areas where we can improve things (and we are working on that, both with kdbus and sd-bus), it generally appears to be a very well designed system, that stood the test of time, aged well and is widely established. Today, if we'd sit down and design a completely new IPC system incorporating all the experience and knowledge we gained with D-Bus, I am sure the result would be very close to what D-Bus already is. Or in short: D-Bus is great. If you hack on a Linux project and need a local IPC, it should be your first choice. Not only because D-Bus is well designed, but also because there aren't many alternatives that can cover similar functionality. # Where does sd-bus fit in? Let's discuss why sd-bus exists, how it compares with the other existing C D-Bus libraries and why it might be a library to consider for your project. For C, there are two established, popular D-Bus libraries: libdbus, as it is shipped in the reference implementation of D-Bus, as well as GDBus, a component of GLib, the low-level tool library of GNOME. Of the two libdbus is the much older one, as it was written at the time the specification was put together. The library was written with a focus on being portable and to be useful as back-end for higher-level language bindings. Both of these goals required the API to be very generic, resulting in a relatively baroque, hard-to-use API that lacks the bits that make it easy and fun to use from C. It provides the building blocks, but few tools to actually make it straightforward to build a house from them. On the other hand, the library is suitable for most use-cases (for example, it is OOM-safe making it suitable for writing lowest level system software), and is portable to operating systems like Windows or more exotic UNIXes. GDBus is a much newer implementation. It has been written after considerable experience with using a GLib/GObject wrapper around libdbus. GDBus is implemented from scratch, shares no code with libdbus. Its design differs substantially from libdbus, it contains code generators to make it specifically easy to expose GObject objects on the bus, or talking to D-Bus objects as GObject objects. It translates D-Bus data types to GVariant, which is GLib's powerful data serialization format. If you are used to GLib-style programming then you'll feel right at home, hacking D-Bus services and clients with it is a lot simpler than using libdbus. With sd-bus we now provide a third implementation, sharing no code with either libdbus or GDBus. For us, the focus was on providing kind of a middle ground between libdbus and GDBus: a low-level C library that actually is fun to work with, that has enough syntactic sugar to make it easy to write clients and services with, but on the other hand is more low-level than GDBus/GLib/GObject/GVariant. To be able to use it in systemd's various system-level components it needed to be OOM-safe and minimal. Another major point we wanted to focus on was supporting a kdbus back-end right from the beginning, in addition to the socket transport of the original D-Bus specification ("dbus1"). In fact, we wanted to design the library closer to kdbus' semantics than to dbus1's, wherever they are different, but still cover both transports nicely. In contrast to libdbus or GDBus portability is not a priority for sd-bus, instead we try to make the best of the Linux platform and expose specific Linux concepts wherever that is beneficial. Finally, performance was also an issue (though a secondary one): neither libdbus nor GDBus will win any speed records. We wanted to improve on performance (throughput and latency) -- but simplicity and correctness are more important to us. We believe the result of our work delivers our goals quite nicely: the library is fun to use, supports kdbus and sockets as back-end, is relatively minimal, and the performance is substantially better than both libdbus and GDBus. To decide which of the three APIs to use for you C project, here are short guidelines: • If you hack on a GLib/GObject project, GDBus is definitely your first choice. • If portability to non-Linux kernels -- including Windows, Mac OS and other UNIXes -- is important to you, use either GDBus (which more or less means buying into GLib/GObject) or libdbus (which requires a lot of manual work). • Otherwise, sd-bus would be my recommended choice. (I am not covering C++ specifically here, this is all about plain C only. But do note: if you use Qt, then QtDBus is the D-Bus API of choice, being a wrapper around libdbus.) # Introduction to D-Bus Concepts To the uninitiated D-Bus usually appears to be a relatively opaque technology. It uses lots of concepts that appear unnecessarily complex and redundant on first sight. But actually, they make a lot of sense. Let's have a look: • A bus is where you look for IPC services. There are usually two kinds of buses: a system bus, of which there's exactly one per system, and which is where you'd look for system services; and a user bus, of which there's one per user, and which is where you'd look for user services, like the address book service or the mail program. (Originally, the user bus was actually a session bus -- so that you get multiple of them if you log in many times as the same user --, and on most setups it still is, but we are working on moving things to a true user bus, of which there is only one per user on a system, regardless how many times that user happens to log in.) • A service is a program that offers some IPC API on a bus. A service is identified by a name in reverse domain name notation. Thus, the org.freedesktop.NetworkManager service on the system bus is where NetworkManager's APIs are available and org.freedesktop.login1 on the system bus is where systemd-logind's APIs are exposed. • A client is a program that makes use of some IPC API on a bus. It talks to a service, monitors it and generally doesn't provide any services on its own. That said, lines are blurry and many services are also clients to other services. Frequently the term peer is used as a generalization to refer to either a service or a client. • An object path is an identifier for an object on a specific service. In a way this is comparable to a C pointer, since that's how you generally reference a C object, if you hack object-oriented programs in C. However, C pointers are just memory addresses, and passing memory addresses around to other processes would make little sense, since they of course refer to the address space of the service, the client couldn't make sense of it. Thus, the D-Bus designers came up with the object path concept, which is just a string that looks like a file system path. Example: /org/freedesktop/login1 is the object path of the 'manager' object of the org.freedesktop.login1 service (which, as we remember from above, is still the service systemd-logind exposes). Because object paths are structured like file system paths they can be neatly arranged in a tree, so that you end up with a venerable tree of objects. For example, you'll find all user sessions systemd-logind manages below the /org/freedesktop/login1/session sub-tree, for example called /org/freedesktop/login1/session/_7, /org/freedesktop/login1/session/_55 and so on. How services precisely label their objects and arrange them in a tree is completely up to the developers of the services. • Each object that is identified by an object path has one or more interfaces. An interface is a collection of signals, methods, and properties (collectively called members), that belong together. The concept of a D-Bus interface is actually pretty much identical to what you know from programming languages such as Java, which also know an interface concept. Which interfaces an object implements are up the developers of the service. Interface names are in reverse domain name notation, much like service names. (Yes, that's admittedly confusing, in particular since it's pretty common for simpler services to reuse the service name string also as an interface name.) A couple of interfaces are standardized though and you'll find them available on many of the objects offered by the various services. Specifically, those are org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer and org.freedesktop.DBus.Properties. • An interface can contain methods. The word "method" is more or less just a fancy word for "function", and is a term used pretty much the same way in object-oriented languages such as Java. The most common interaction between D-Bus peers is that one peer invokes one of these methods on another peer and gets a reply. A D-Bus method takes a couple of parameters, and returns others. The parameters are transmitted in a type-safe way, and the type information is included in the introspection data you can query from each object. Usually, method names (and the other member types) follow a CamelCase syntax. For example, systemd-logind exposes an ActivateSession method on the org.freedesktop.login1.Manager interface that is available on the /org/freedesktop/login1 object of the org.freedesktop.login1 service. • A signature describes a set of parameters a function (or signal, property, see below) takes or returns. It's a series of characters that each encode one parameter by its type. The set of types available is pretty powerful. For example, there are simpler types like s for string, or u for 32bit integer, but also complex types such as as for an array of strings or a(sb) for an array of structures consisting of one string and one boolean each. See the D-Bus specification for the full explanation of the type system. The ActivateSession method mentioned above takes a single string as parameter (the parameter signature is hence s), and returns nothing (the return signature is hence the empty string). Of course, the signature can get a lot more complex, see below for more examples. • A signal is another member type that the D-Bus object system knows. Much like a method it has a signature. However, they serve different purposes. While in a method call a single client issues a request on a single service, and that service sends back a response to the client, signals are for general notification of peers. Services send them out when they want to tell one or more peers on the bus that something happened or changed. In contrast to method calls and their replies they are hence usually broadcast over a bus. While method calls/replies are used for duplex one-to-one communication, signals are usually used for simplex one-to-many communication (note however that that's not a requirement, they can also be used one-to-one). Example: systemd-logind broadcasts a SessionNew signal from its manager object each time a user logs in, and a SessionRemoved signal every time a user logs out. • A property is the third member type that the D-Bus object system knows. It's similar to the property concept known by languages like C#. Properties also have a signature, and are more or less just variables that an object exposes, that can be read or altered by clients. Example: systemd-logind exposes a property Docked of the signature b (a boolean). It reflects whether systemd-logind thinks the system is currently in a docking station of some form (only applies to laptops …). So much for the various concepts D-Bus knows. Of course, all these new concepts might be overwhelming. Let's look at them from a different perspective. I assume many of the readers have an understanding of today's web technology, specifically HTTP and REST. Let's try to compare the concept of a HTTP request with the concept of a D-Bus method call: • A HTTP request you issue on a specific network. It could be the Internet, or it could be your local LAN, or a company VPN. Depending on which network you issue the request on, you'll be able to talk to a different set of servers. This is not unlike the "bus" concept of D-Bus. • On the network you then pick a specific HTTP server to talk to. That's roughly comparable to picking a service on a specific bus. • On the HTTP server you then ask for a specific URL. The "path" part of the URL (by which I mean everything after the host name of the server, up to the last "/") is pretty similar to a D-Bus object path. • The "file" part of the URL (by which I mean everything after the last slash, following the path, as described above), then defines the actual call to make. In D-Bus this could be mapped to an interface and method name. • Finally, the parameters of a HTTP call follow the path after the "?", they map to the signature of the D-Bus call. Of course, comparing an HTTP request to a D-Bus method call is a bit comparing apples and oranges. However, I think it's still useful to get a bit of a feeling of what maps to what. # From the shell So much about the concepts and the gray theory behind them. Let's make this exciting, let's actually see how this feels on a real system. Since a while systemd has included a tool busctl that is useful to explore and interact with the D-Bus object system. When invoked without parameters, it will show you a list of all peers connected to the system bus. (Use --user to see the peers of your user bus instead): $ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]


(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the bus. They are identified by peer names like ":1.11". These are called unique names in D-Bus nomenclature. Basically, every peer has a unique name, and they are assigned automatically when a peer connects to the bus. They are much like an IP address if you so will. You'll notice that a couple of peers are already connected, including our little busctl tool itself as well as a number of system services. The list then shows all actual services on the bus, identified by their service names (as discussed above; to discern them from the unique names these are also called well-known names). In many ways well-known names are similar to DNS host names, i.e. they are a friendlier way to reference a peer, but on the lower level they just map to an IP address, or in this comparison the unique name. Much like you can connect to a host on the Internet by either its host name or its IP address, you can also connect to a bus peer either by its unique or its well-known name. (Note that each peer can have as many well-known names as it likes, much like an IP address can have multiple host names referring to it).

OK, that's already kinda cool. Try it for yourself, on your local machine (all you need is a recent, systemd-based distribution).

Let's now go the next step. Let's see which objects the org.freedesktop.login1 service actually offers:

$busctl tree org.freedesktop.login1 └─/org/freedesktop/login1 ├─/org/freedesktop/login1/seat │ ├─/org/freedesktop/login1/seat/seat0 │ └─/org/freedesktop/login1/seat/self ├─/org/freedesktop/login1/session │ ├─/org/freedesktop/login1/session/_31 │ └─/org/freedesktop/login1/session/self └─/org/freedesktop/login1/user ├─/org/freedesktop/login1/user/_1000 └─/org/freedesktop/login1/user/self  Pretty, isn't it? What's actually even nicer, and which the output does not show is that there's full command line completion available: as you press TAB the shell will auto-complete the service names for you. It's a real pleasure to explore your D-Bus objects that way! The output shows some objects that you might recognize from the explanations above. Now, let's go further. Let's see what interfaces, methods, signals and properties one of these objects actually exposes: $ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -


As before, the busctl command supports command line completion, hence both the service name and the object path used are easily put together on the shell simply by pressing TAB. The output shows the methods, properties, signals of one of the session objects that are currently made available by systemd-logind. There's a section for each interface the object knows. The second column tells you what kind of member is shown in the line. The third column shows the signature of the member. In case of method calls that's the input parameters, the fourth column shows what is returned. For properties, the fourth column encodes the current value of them.

So far, we just explored. Let's take the next step now: let's become active - let's call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock


I don't think I need to mention this anymore, but anyway: again there's full command line completion available. The third argument is the interface name, the fourth the method name, both can be easily completed by pressing TAB. In this case we picked the Lock method, which activates the screen lock for the specific session. And yupp, the instant I pressed enter on this line my screen lock turned on (this only works on DEs that correctly hook into systemd-logind for this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no parameters and returns none. Of course, it can get more complicated for some calls. Here's another example, this time using one of systemd's own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"


This call takes two strings as input parameters, as we denote in the signature string that follows the method name (as usual, command line completion helps you getting this right). Following the signature the next two parameters are simply the two strings to pass. The specified signature string hence indicates what comes next. systemd's StartUnit method call takes the unit name to start as first parameter, and the mode in which to start it as second. The call returned a single object path value. It is encoded the same way as the input parameter: a signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but with busctl it's relatively easy to encode them all. See the man page for details.

busctl knows a number of other operations. For example, you can use it to monitor D-Bus traffic as it happens (including generating a .cap file for use with Wireshark!) or you can set or get specific properties. However, this blog story was supposed to be about sd-bus, not busctl, hence let's cut this short here, and let me direct you to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus API. Thus it exposes many of the features of sd-bus itself. For example, you can use to connect to remote or container buses. It understands both kdbus and classic D-Bus, and more!

# sd-bus

But enough! Let's get back on topic, let's talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file sd-bus.h.

Here's a random selection of features of the library, that make it compare well with the other implementations available.

• Supports both kdbus and dbus1 as back-end.

• Has high-level support for connecting to remote buses via ssh, and to buses of local OS containers.

• Powerful credential model, to implement authentication of clients in services. Currently 34 individual fields are supported, from the PID of the client to the cgroup or capability sets.

• Support for tracking the life-cycle of peers in order to release local objects automatically when all peers referencing them disconnected.

• The client builds an efficient decision tree to determine which handlers to deliver an incoming bus message to.

• Automatically translates D-Bus errors into UNIX style errors and back (this is lossy though), to ensure best integration of D-Bus into low-level Linux programs.

• Powerful but lightweight object model for exposing local objects on the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on completing the set of manual pages. For details see all pages starting with sd_bus_.

# Invoking a Method, from C, with sd-bus

So much about the library in general. Here's an example for connecting to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
sd_bus_error error = SD_BUS_ERROR_NULL;
sd_bus_message *m = NULL;
sd_bus *bus = NULL;
const char *path;
int r;

/* Connect to the system bus */
r = sd_bus_open_system(&bus);
if (r < 0) {
fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
goto finish;
}

/* Issue the method call and store the respons message in m */
r = sd_bus_call_method(bus,
"org.freedesktop.systemd1",           /* service to contact */
"/org/freedesktop/systemd1",          /* object path */
"org.freedesktop.systemd1.Manager",   /* interface name */
"StartUnit",                          /* method name */
&error,                               /* object to return error in */
&m,                                   /* return message on success */
"ss",                                 /* input signature */
"cups.service",                       /* first argument */
"replace");                           /* second argument */
if (r < 0) {
fprintf(stderr, "Failed to issue method call: %s\n", error.message);
goto finish;
}

/* Parse the response message */
if (r < 0) {
fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
goto finish;
}

printf("Queued service job as %s.\n", path);

finish:
sd_bus_error_free(&error);
sd_bus_message_unref(m);
sd_bus_unref(bus);

return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}


Save this example as bus-client.c, then build it with:

$gcc bus-client.c -o bus-client pkg-config --cflags --libs libsystemd  This will generate a binary bus-client you can now run. Make sure to run it as root though, since access to the StartUnit method is privileged: # ./bus-client Queued service job as /org/freedesktop/systemd1/job/3586.  And that's it already, our first example. It showed how we invoked a method call on the bus. The actual function call of the method is very close to the busctl command line we used before. I hope the code excerpt needs little further explanation. It's supposed to give you a taste how to write D-Bus clients with sd-bus. For more more information please have a look at the header file, the man page or even the sd-bus sources. # Implementing a Service, in C, with sd-bus Of course, just calling a single method is a rather simplistic example. Let's have a look on how to write a bus service. We'll write a small calculator service, that exposes a single object, which implements an interface that exposes two methods: one to multiply two 64bit signed integers, and one to divide one 64bit signed integer by another. #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <systemd/sd-bus.h> static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) { int64_t x, y; int r; /* Read the parameters */ r = sd_bus_message_read(m, "xx", &x, &y); if (r < 0) { fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r)); return r; } /* Reply with the response */ return sd_bus_reply_method_return(m, "x", x * y); } static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) { int64_t x, y; int r; /* Read the parameters */ r = sd_bus_message_read(m, "xx", &x, &y); if (r < 0) { fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r)); return r; } /* Return an error on division by zero */ if (y == 0) { sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero."); return -EINVAL; } return sd_bus_reply_method_return(m, "x", x / y); } /* The vtable of our little object, implements the net.poettering.Calculator interface */ static const sd_bus_vtable calculator_vtable[] = { SD_BUS_VTABLE_START(0), SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED), SD_BUS_METHOD("Divide", "xx", "x", method_divide, SD_BUS_VTABLE_UNPRIVILEGED), SD_BUS_VTABLE_END }; int main(int argc, char *argv[]) { sd_bus_slot *slot = NULL; sd_bus *bus = NULL; int r; /* Connect to the user bus this time */ r = sd_bus_open_user(&bus); if (r < 0) { fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r)); goto finish; } /* Install the object */ r = sd_bus_add_object_vtable(bus, &slot, "/net/poettering/Calculator", /* object path */ "net.poettering.Calculator", /* interface name */ calculator_vtable, NULL); if (r < 0) { fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r)); goto finish; } /* Take a well-known service name so that clients can find us */ r = sd_bus_request_name(bus, "net.poettering.Calculator", 0); if (r < 0) { fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r)); goto finish; } for (;;) { /* Process requests */ r = sd_bus_process(bus, NULL); if (r < 0) { fprintf(stderr, "Failed to process bus: %s\n", strerror(-r)); goto finish; } if (r > 0) /* we processed a request, try to process another one, right-away */ continue; /* Wait for the next request to process */ r = sd_bus_wait(bus, (uint64_t) -1); if (r < 0) { fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r)); goto finish; } } finish: sd_bus_slot_unref(slot); sd_bus_unref(bus); return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS; }  Save this example as bus-service.c, then build it with: $ gcc bus-service.c -o bus-service pkg-config --cflags --libs libsystemd


Now, let's run it:

$./bus-service  In another terminal, let's try to talk to it. Note that this service is now on the user bus, not on the system bus as before. We do this for simplicity reasons: on the system bus access to services is tightly controlled so unprivileged clients cannot request privileged operations. On the user bus however things are simpler: as only processes of the user owning the bus can connect no further policy enforcement will complicate this example. Because the service is on the user bus, we have to pass the --user switch on the busctl command line. Let's start with looking at the service's object tree. $ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator


As we can see, there's only a single object on the service, which is not surprising, given that our code above only registered one. Let's see the interfaces and the members this object exposes:

$busctl --user introspect net.poettering.Calculator /net/poettering/Calculator NAME TYPE SIGNATURE RESULT/VALUE FLAGS net.poettering.Calculator interface - - - .Divide method xx x - .Multiply method xx x - org.freedesktop.DBus.Introspectable interface - - - .Introspect method - s - org.freedesktop.DBus.Peer interface - - - .GetMachineId method - s - .Ping method - - - org.freedesktop.DBus.Properties interface - - - .Get method ss v - .GetAll method s a{sv} - .Set method ssv - - .PropertiesChanged signal sa{sv}as - -  The sd-bus library automatically added a couple of generic interfaces, as mentioned above. But the first interface we see is actually the one we added! It shows our two methods, and both take "xx" (two 64bit signed integers) as input parameters, and return one "x". Great! But does it work? $ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35


Woohoo! We passed the two integers 5 and 7, and the service actually multiplied them for us and returned a single integer 35! Let's try the other method:

$busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17 x 5  Oh, wow! It can even do integer division! Fantastic! But let's trick it into dividing by zero: $ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.


Nice! It detected this nicely and returned a clean error about it. If you look in the source code example above you'll see how precisely we generated the error.

And that's really all I have for today. Of course, the examples I showed are short, and I don't get into detail here on what precisely each line does. However, this is supposed to be a short introduction into D-Bus and sd-bus, and it's already way too long for that …

I hope this blog story was useful to you. If you are interested in using sd-bus for your own programs, I hope this gets you started. If you have further questions, check the (incomplete) man pages, and inquire us on IRC or the systemd mailing list. If you need more examples, have a look at the systemd source tree, all of systemd's many bus services use sd-bus extensively.

 June 16, 2015

Recently, I've been fighting with the never ending issue of timezones. I never thought I would have plunged into this rabbit hole, but hacking on OpenStack and Gnocchi I felt into that trap easily is, thanks to Python.

## “Why you really, really, should never ever deal with timezones”

To get a glimpse of the complexity of timezones, I recommend that you watch Tom Scott's video on the subject. It's fun and it summarizes remarkably well the nightmare that timezones are and why you should stop thinking that you're smart.

## The importance of timezones in applications

Once you've heard what Tom says, I think it gets pretty clear that a timestamp without any timezone attached does not give any useful information. It should be considered irrelevant and useless. Without the necessary context given by the timezone, you cannot infer what point in time your application is really referring to.

That means your application should never handle timestamps with no timezone information. It should try to guess or raises an error if no timezone is provided in any input.

Of course, you can infer that having no timezone information means UTC. This sounds very handy, but can also be dangerous in certain applications or language – such as Python, as we'll see.

Indeed, in certain applications, converting timestamps to UTC and losing the timezone information is a terrible idea. Imagine that a user create a recurring event every Wednesday at 10:00 in its local timezone, say CET. If you convert that to UTC, the event will end up being stored as every Wednesday at 09:00.

Now imagine that the CET timezone switches from UTC+01:00 to UTC+02:00: your application will compute that the event starts at 11:00 CET every Wednesday. Which is wrong, because as the user told you, the event starts at 10:00 CET, whatever the definition of CET is. Not at 11:00 CET. So CET means CET, not necessarily UTC+1.

As for endpoints like REST API, a thing I daily deal with, all timestamps should include a timezone information. It's nearly impossible to know what timezone the timestamps are in otherwise: UTC? Server local? User local? No way to know.

## Python design & defect

Python comes with a timestamp object named datetime.datetime. It can store date and time precise to the microsecond, and is qualified of timezone "aware" or "unaware", whether it embeds a timezone information or not.

To build such an object based on the current time, one can use datetime.datetime.utcnow() to retrieve the date and time for the UTC timezone, and datetime.datetime.now() to retrieve the date and time for the current timezone, whatever it is.

>>> import datetime>>> datetime.datetime.utcnow()datetime.datetime(2015, 6, 15, 13, 24, 48, 27631)>>> datetime.datetime.now()datetime.datetime(2015, 6, 15, 15, 24, 52, 276161)

As you can notice, none of these results contains timezone information. Indeed, Python datetime API always returns unaware datetime objects, which is very unfortunate. Indeed, as soon as you get one of this object, there is no way to know what the timezone is, therefore these objects are pretty "useless" on their own.

Armin Ronacher proposes that an application always consider that the unaware datetime objects from Python are considered as UTC. As we just saw, that statement cannot be considered true for objects returned by datetime.datetime.now(), so I would not advise doing so. datetime objects with no timezone should be considered as a "bug" in the application.

## Recommendations

My recommendation list comes down to:

1. Always use aware datetime object, i.e. with timezone information. That makes sure you can compare them directly (aware and unaware datetime objects are not comparable) and will return them correctly to users. Leverage pytz to have timezone objects.
2. Use ISO 8601 as input and output string format. Use datetime.datetime.isoformat() to return timestamps as string formatted using that format, which includes the timezone information.

In Python, that's equivalent to having:

>>> import datetime>>> import pytz>>> def utcnow():    return datetime.datetime.now(tz=pytz.utc)>>> utcnow()datetime.datetime(2015, 6, 15, 14, 45, 19, 182703, tzinfo=<UTC>)>>> utcnow().isoformat()'2015-06-15T14:45:21.982600+00:00'

If you need to parse strings containing ISO 8601 formatted timestamp, you can rely on the iso8601, which returns timestamps with correct timezone information. This makes timestamps directly comparable:

>>> import iso8601>>> iso8601.parse_date(utcnow().isoformat())datetime.datetime(2015, 6, 15, 14, 46, 43, 945813, tzinfo=<FixedOffset '+00:00' datetime.timedelta(0)>)>>> iso8601.parse_date(utcnow().isoformat()) < utcnow()True

If you need to store those timestamps, the same rule should apply. If you rely on MongoDB, it assumes that all the timestamp are in UTC, so be careful when storing them – you will have to normalize the timestamp to UTC.

For MySQL, nothing is assumed, it's up to the application to insert them in a timezone that makes sense to it. Obviously, if you have multiple applications accessing the same database with different data sources, this can end up being a nightmare.

PostgreSQL has a special data type that is recommended called timestamp with timezone, and which can store the timezone associated, and do all the computation for you. That's the recommended way to store them obviously. That does not mean you should not use UTC in most cases; that just means you are sure that the timestamp are stored in UTC since it's written in the database, and you check if any other application inserted timestamps with different timezone.

## OpenStack status

As a side note, I've improved OpenStack situation recently by changing the oslo.utils.timeutils module to deprecate some useless and dangerous functions. I've also added support for returning timezone aware objects when using the oslo_utils.timeutils.utcnow() function. It's not possible to make it a default unfortunately for backward compatibility reason, but it's there nevertheless, and it's advised to use it. Thanks to my colleague Victor for the help!

Have a nice day, whatever your timezone is!

Hello,

This weekend my work on implementing NVIDIA global performance counters has been merged in Nouveau.

With Linux 4.2, Nouveau will allow the userspace to monitor both compute and graphics (global) counters for Tesla, but only compute counters for Fermi. I need to go back to Windows for reverse engineering graphics counters with NVIDIA Perfkit. About Kepler, I have to figure out how to deal with clock gating but this is not going to be hard, so I’ll probably submit a series which adds compute counters this month.

All of these performance counters will be exposed through the Gallium’s HUD and GL_AMD_performance_monitor once I have finished writing the code in mesa.

But don’t be too excited for the moment, because we still need to implement the new nvif interface exposed by Nouveau in libdrm.

My plan is to complete all of this work before the XDC 2015.

Thanks!

 June 12, 2015

Last post and the one before were about how to create your own piglit tests. Previously, I have written an introduction to piglit and how to launch a tailored piglit run (more details about these last two topics in my FOSDEM 2015 talk).

Now it’s time to talk about how to contribute to piglit.

# How to contribute to piglit

Once you want to contribute something to piglit, you need to generate the patches and send them for review to the mailing list. They are usually created by git format-patch and sent by git send-email command (if you need help with git, there are a lot of tutorials). Remember to rebase your branch against up-to-dated master before creating the patches, so no merge conflicts will appear if the reviewers wants to apply them locally.

Whether you have some patches ready to be submitted or you have questions about piglit, subscribe to piglit@lists.freedesktop.org and send them there.

Most piglit developers work in other areas (such as OpenGL driver development!) which means that the review process of piglit patches could take some time, so be patient and wait.

If after some time (something like one or two weeks) there is no answer about your patches, you can send a reminder saying that a review is pending. If you have commit rights and the patch is trivial (or you are very confident that it is right), you can even push it to the repository’s master branch after that time. Piglit is not as strict as other projects in this regard, however do not abuse of this rule.

Once you have a good track of contributions or other contributors told you to do so, you can ask for piglit repository’s commit rights by following these instructions. And don’t hesitate to review patches from others!

This is the list of my piglit related posts:

Plus my FOSDEM 2015 talk.

Thanks for following this short introduction to piglit. Happy hacking!

 June 11, 2015

Last post I talked about how to develop GLSL shader tests and how to add them to piglit. This is a nice way to develop simple tests but sometimes you need to do something more complex. For that case, piglit can run binary test programs.

# Introduction

Binary test programs in piglit are written in C language taking advantage of the piglit framework to facilitate the test development (there is no main() function, no need to setup GLX (or WGL or EGL or whatever), no need to check manually the available extensions or call to glXSwapBuffers() or similar functions…), however you can still use all OpenGL C-language API.

## Simple example

Piglit framework is mostly undocumented but easy to understand once you start reading some existing tests. So I will start with a simple example explaining how a typical binary test looks like and then show you a real example.

/*
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/

/** @file test-example.c
*
* This test shows an skeleton for creating new tests.
*/

#include "piglit-util-gl.h"

PIGLIT_GL_TEST_CONFIG_BEGIN
config.window_width = 100;
config.window_height = 100;
config.supports_gl_compat_version = 10;
config.supports_gl_core_version = 31;
config.window_visual = PIGLIT_GL_VISUAL_DOUBLE | PIGLIT_GL_VISUAL_RGBA;

PIGLIT_GL_TEST_CONFIG_END

static const char vs_pass_thru_text[] =
"#version 330n"
"n"
"in vec4 piglit_vertex;n"
"n"
"void main() {n"
"   gl_Position = piglit_vertex;n"
"}n";

static const char fs_source[] =
"#version 330n"
"n"
"void main() {n"
"       color = vec4(1.0, 0.0. 0.0, 1.0);n"
"}n";

GLuint prog;

void
piglit_init(int argc, char **argv)
{
bool pass = true;

prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);

glUseProgram(prog);

glClearColor(0, 0, 0, 0);

/* <-- OpenGL commands to be done --> */

glViewport(0, 0, piglit_width, piglit_height);

/* piglit_draw_* commands can go into piglit_display() too */
piglit_draw_rect(-1, -1, 2, 2);

if (!piglit_check_gl_error(GL_NO_ERROR))
pass = false;

piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);
}

enum piglit_result piglit_display(void)
{
/* <-- OpenGL drawing commands, if needed --> */

/* UNREACHED */
return PIGLIT_FAIL;
}

As you see in this example, there are four different parts:

• The license details should be included in each source file. There is one agreed by most contributors and it’s a MIT license assigning the copyright to Intel Corporation. More information in COPYING file.
• It includes a brief description of what the test does.

/*
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/

/** @file test-example.c
*
* This test shows an skeleton for creating new tests.
*/

2. Piglit setup. This is needed to check if a test can be executed by a given driver (minimum supported GL version), or to create a window of a specific size, or even to define if we want double buffering.

PIGLIT_GL_TEST_CONFIG_BEGIN
config.window_width = 100;
config.window_height = 100;
config.supports_gl_compat_version = 10;
config.supports_gl_core_version = 31;
config.window_visual = PIGLIT_GL_VISUAL_DOUBLE | PIGLIT_GL_VISUAL_RGBA;

PIGLIT_GL_TEST_CONFIG_END

3. piglit_init(). This is the function that it is going to be called for configuring the test itself. Some tests implement all their code inside piglit_init() because it doesn’t need to draw anything (or it doesn’t need to update the drawing frame by frame). In any case, you usually put here the following code:
• Check for needed extensions.
• Check for limits or maximum values of different variables like GL_MAX_SHADER_STORAGE_BLOCKS, GL_UNIFORM_BLOCK_SIZE, etc.
• Setup constant data, upload it.
• All the initialization setup you need for drawing commands: compile and link shaders, set clear color, etc.

void
piglit_init(int argc, char **argv)
{
bool pass = true;

prog = piglit_build_simple_program(vs_pass_thru_text, fs_source);

glUseProgram(prog);

glClearColor(0, 0, 0, 0);

/* <-- OpenGL commands to be done --> */

glViewport(0, 0, piglit_width, piglit_height);

/* piglit_draw_* commands can go into piglit_display() too */
piglit_draw_rect(-1, -1, 2, 2);

if (!piglit_check_gl_error(GL_NO_ERROR))
pass = false;

piglit_report_result(pass ? PIGLIT_PASS : PIGLIT_FAIL);
}

4. piglit_display(). This is the function that it is going to be executed periodically to update each frame of the rendered window. In some tests, you will find it almost empty (it returns PIGLIT_FAIL) because it is not needed by the test program.

enum piglit_result piglit_display(void)
{
/* <-- OpenGL drawing commands, if needed --> */

/* UNREACHED */
return PIGLIT_FAIL;
}

Notice that you are free to add any helper functions you need like any other C program but the aforementioned parts are required by piglit.

# Piglit API

Piglit provides a lot of functions under its API to be used by the test program. They are usually often-used functions that substitute one or several OpenGL function calls and other code that accompany them.

The available functions are listed in piglit-util.gl.h file, which must be included in every binary test source code.

/*
* Copyright (c) The Piglit project 2007
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* on the rights to use, copy, modify, merge, publish, distribute, sub
* license, and/or sell copies of the Software, and to permit persons to whom
* the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.  IN NO EVENT SHALL
* VA LINUX SYSTEM, IBM AND/OR THEIR SUPPLIERS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
* USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

#pragma once
#ifndef __PIGLIT_UTIL_GL_H__
#define __PIGLIT_UTIL_GL_H__

#ifdef __cplusplus
extern "C" {
#endif

#include "piglit-util.h"

#include <piglit/gl_wrap.h>
#include <piglit/glut_wrap.h>

#include "piglit-framework-gl.h"

extern const uint8_t fdo_bitmap[];
extern const unsigned int fdo_bitmap_width;
extern const unsigned int fdo_bitmap_height;

extern bool piglit_is_core_profile;

/**
* Determine if the API is OpenGL ES.
*/
bool piglit_is_gles(void);

/**
* \brief Get version of OpenGL or OpenGL ES API.
*
* Returned version is multiplied by 10 to make it an integer.  So for
* example, if the GL version is 2.1, the return value is 21.
*/
int piglit_get_gl_version(void);

/**
* \precondition name is not null
*/
bool piglit_is_extension_supported(const char *name);

/**
* reinitialize the supported extension List.
*/
void piglit_gl_reinitialize_extensions();

/**
* \brief Convert a GL error to a string.
*
* For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
*
* Return "(unrecognized error)" if the enum is not recognized.
*/
const char* piglit_get_gl_error_name(GLenum error);

/**
* \brief Convert a GL enum to a string.
*
* For example, given GL_INVALID_ENUM, return "GL_INVALID_ENUM".
*
* Return "(unrecognized enum)" if the enum is not recognized.
*/
const char *piglit_get_gl_enum_name(GLenum param);

/**
* \brief Convert a string to a GL enum.
*
* For example, given "GL_INVALID_ENUM", return GL_INVALID_ENUM.
*
* abort() if the string is not recognized.
*/
GLenum piglit_get_gl_enum_from_name(const char *name);

/**
* \brief Convert a GL primitive type enum value to a string.
*
* For example, given GL_POLYGON, return "GL_POLYGON".
* We don't use piglit_get_gl_enum_name() for this because there are
* other enums which alias the prim type enums (ex: GL_POINTS = GL_NONE);
*
* Return "(unrecognized enum)" if the enum is not recognized.
*/
const char *piglit_get_prim_name(GLenum prim);

/**
* \brief Check for unexpected GL errors.
*
* If glGetError() returns an error other than \c expected_error, then
* print a diagnostic and return GL_FALSE.  Otherwise return GL_TRUE.
*/
GLboolean
piglit_check_gl_error_(GLenum expected_error, const char *file, unsigned line);

#define piglit_check_gl_error(expected) \
piglit_check_gl_error_((expected), __FILE__, __LINE__)

/**
* \brief Drain all GL errors.
*
* Repeatly call glGetError and discard errors until it returns GL_NO_ERROR.
*/
void piglit_reset_gl_error(void);

void piglit_require_gl_version(int required_version_times_10);
void piglit_require_extension(const char *name);
void piglit_require_not_extension(const char *name);
unsigned piglit_num_components(GLenum base_format);
bool piglit_get_luminance_intensity_bits(GLenum internalformat, int *bits);
int piglit_probe_pixel_rgb_silent(int x, int y, const float* expected, float *out_probe);
int piglit_probe_pixel_rgba_silent(int x, int y, const float* expected, float *out_probe);
int piglit_probe_pixel_rgb(int x, int y, const float* expected);
int piglit_probe_pixel_rgba(int x, int y, const float* expected);
int piglit_probe_rect_r_ubyte(int x, int y, int w, int h, GLubyte expected);
int piglit_probe_rect_rgb(int x, int y, int w, int h, const float* expected);
int piglit_probe_rect_rgb_silent(int x, int y, int w, int h, const float *expected);
int piglit_probe_rect_rgba(int x, int y, int w, int h, const float* expected);
int piglit_probe_rect_rgba_int(int x, int y, int w, int h, const int* expected);
int piglit_probe_rect_rgba_uint(int x, int y, int w, int h, const unsigned int* expected);
void piglit_compute_probe_tolerance(GLenum format, float *tolerance);
int piglit_compare_images_color(int x, int y, int w, int h, int num_components,
const float *tolerance,
const float *expected_image,
const float *observed_image);
int piglit_probe_image_color(int x, int y, int w, int h, GLenum format, const float *image);
int piglit_probe_image_rgb(int x, int y, int w, int h, const float *image);
int piglit_probe_image_rgba(int x, int y, int w, int h, const float *image);
int piglit_compare_images_ubyte(int x, int y, int w, int h,
const GLubyte *expected_image,
const GLubyte *observed_image);
int piglit_probe_image_stencil(int x, int y, int w, int h, const GLubyte *image);
int piglit_probe_image_ubyte(int x, int y, int w, int h, GLenum format,
const GLubyte *image);
int piglit_probe_texel_rect_rgb(int target, int level, int x, int y,
int w, int h, const float *expected);
int piglit_probe_texel_rgb(int target, int level, int x, int y,
const float* expected);
int piglit_probe_texel_rect_rgba(int target, int level, int x, int y,
int w, int h, const float *expected);
int piglit_probe_texel_rgba(int target, int level, int x, int y,
const float* expected);
int piglit_probe_texel_volume_rgba(int target, int level, int x, int y, int z,
int w, int h, int d, const float *expected);
int piglit_probe_pixel_depth(int x, int y, float expected);
int piglit_probe_rect_depth(int x, int y, int w, int h, float expected);
int piglit_probe_pixel_stencil(int x, int y, unsigned expected);
int piglit_probe_rect_stencil(int x, int y, int w, int h, unsigned expected);
int piglit_probe_rect_halves_equal_rgba(int x, int y, int w, int h);

bool piglit_probe_buffer(GLuint buf, GLenum target, const char *label,
unsigned n, unsigned num_components,
const float *expected);

int piglit_use_fragment_program(void);
int piglit_use_vertex_program(void);
void piglit_require_fragment_program(void);
void piglit_require_vertex_program(void);
GLuint piglit_compile_program(GLenum target, const char* text);
GLvoid piglit_draw_triangle(float x1, float y1, float x2, float y2,
float x3, float y3);
GLvoid piglit_draw_triangle_z(float z, float x1, float y1, float x2, float y2,
float x3, float y3);
GLvoid piglit_draw_rect_custom(float x, float y, float w, float h,
bool use_patches);
GLvoid piglit_draw_rect(float x, float y, float w, float h);
GLvoid piglit_draw_rect_z(float z, float x, float y, float w, float h);
GLvoid piglit_draw_rect_tex(float x, float y, float w, float h,
float tx, float ty, float tw, float th);
GLvoid piglit_draw_rect_back(float x, float y, float w, float h);
void piglit_draw_rect_from_arrays(const void *verts, const void *tex,
bool use_patches);

unsigned short piglit_half_from_float(float val);

void piglit_escape_exit_key(unsigned char key, int x, int y);

void piglit_gen_ortho_projection(double left, double right, double bottom,
double top, double near_val, double far_val,
GLboolean push);
void piglit_ortho_projection(int w, int h, GLboolean push);
void piglit_frustum_projection(GLboolean push, double l, double r, double b,
double t, double n, double f);
void piglit_gen_ortho_uniform(GLint location, double left, double right,
double bottom, double top, double near_val,
double far_val);
void piglit_ortho_uniform(GLint location, int w, int h);

GLuint piglit_checkerboard_texture(GLuint tex, unsigned level,
unsigned width, unsigned height,
unsigned horiz_square_size, unsigned vert_square_size,
const float *black, const float *white);
GLuint piglit_miptree_texture(void);
GLfloat *piglit_rgbw_image(GLenum internalFormat, int w, int h,
GLboolean alpha, GLenum basetype);
GLubyte *piglit_rgbw_image_ubyte(int w, int h, GLboolean alpha);
GLuint piglit_rgbw_texture(GLenum internalFormat, int w, int h, GLboolean mip,
GLboolean alpha, GLenum basetype);
GLuint piglit_depth_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
GLuint piglit_array_texture(GLenum target, GLenum format, int w, int h, int d, GLboolean mip);
GLuint piglit_multisample_texture(GLenum target, GLenum tex,
GLenum internalFormat,
unsigned width, unsigned height,
unsigned depth, unsigned samples,
GLenum format, GLenum type, void *data);
extern float piglit_tolerance[4];
void piglit_set_tolerance_for_bits(int rbits, int gbits, int bbits, int abits);
extern void piglit_require_transform_feedback(void);

bool
piglit_get_compressed_block_size(GLenum format,
unsigned *bw, unsigned *bh, unsigned *bytes);

unsigned
piglit_compressed_image_size(GLenum format, unsigned width, unsigned height);

unsigned
piglit_compressed_pixel_offset(GLenum format, unsigned width,
unsigned x, unsigned y);

void
piglit_visualize_image(float *img, GLenum base_internal_format,
int image_width, int image_height,
int image_count, bool rhs);

float piglit_srgb_to_linear(float x);
float piglit_linear_to_srgb(float x);

extern GLfloat cube_face_texcoords[6][4][3];
extern const char *cube_face_names[6];
extern const GLenum cube_face_targets[6];

/**
* Common vertex program code to perform a model-view-project matrix transform
*/
#define PIGLIT_VERTEX_PROGRAM_MVP_TRANSFORM        \
"ATTRIB iPos = vertex.position;\n"      \
"OUTPUT oPos = result.position;\n"      \
"PARAM  mvp[4] = { state.matrix.mvp };\n"   \
"DP4    oPos.x, mvp[0], iPos;\n"        \
"DP4    oPos.y, mvp[1], iPos;\n"        \
"DP4    oPos.z, mvp[2], iPos;\n"        \
"DP4    oPos.w, mvp[3], iPos;\n"

/**
* Handle to a generic fragment program that passes the input color to output
*
* \note
* Either \c piglit_use_fragment_program or \c piglit_require_fragment_program
* must be called before using this program handle.
*/
extern GLint piglit_ARBfp_pass_through;

static const GLint PIGLIT_ATTRIB_POS = 0;
static const GLint PIGLIT_ATTRIB_TEX = 1;

/**
* Given a GLSL version number, return the lowest-numbered GL version
* that is guaranteed to support it.
*/
unsigned
required_gl_version_from_glsl_version(unsigned glsl_version);

#ifdef __cplusplus
} /* end extern "C" */
#endif

#endif /* __PIGLIT_UTIL_GL_H__ */

Most functions are undocumented although there are a lot of examples of how to use it in other piglit tests. Furthermore, once you know which function you need, it is usually straightforward to learn how to call it.

Just to mention a few: you can request extensions (piglit_require_extension()) or a GL version (piglit_require_gl_version()), compile program (piglit_compile_program()), draw a rectangle or triangle, read pixel values and compare them with a list of expected values, check for GL errors, etc.

/*
* Copyright (c) The Piglit project 2007
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* on the rights to use, copy, modify, merge, publish, distribute, sub
* license, and/or sell copies of the Software, and to permit persons to whom
* the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.  IN NO EVENT SHALL
* VA LINUX SYSTEM, IBM AND/OR THEIR SUPPLIERS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
* USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

#pragma once

/**
* Null parameters are ignored.
*
* \param es Is it GLSL ES?
*/
void piglit_get_glsl_version(bool *es, int* major, int* minor);

GLuint piglit_compile_shader(GLenum target, const char *filename);
GLuint piglit_compile_shader_text_nothrow(GLenum target, const char *text);
GLuint piglit_compile_shader_text(GLenum target, const char *text);
GLint piglit_build_simple_program(const char *vs_source, const char *fs_source);
const char *fs_source);
const char*source1,
va_list ap);
const char *source1,
...);
const char *source1,
...);

extern GLboolean piglit_program_pipeline_check_status(GLuint pipeline);
extern GLboolean piglit_program_pipeline_check_status_quiet(GLuint pipeline);

/**
* Require a specific version of GLSL.
*
* \param version Integer version, for example 130
*/
extern void piglit_require_GLSL_version(int version);
/** Require any version of GLSL */
extern void piglit_require_GLSL(void);
extern void piglit_require_vertex_shader(void);

There are more header files such as piglit-glx-util.h, piglit-matrix.h, piglit-util-egl.h, etc.

Usually, you only need to add piglit-util-gl.h to your source code, however I recommend you to browse through tests/util/ so you find out all the available functions that piglit provides.

# Example

A complete example of how a piglit binary test looks like is ARB_uniform_buffer_object rendering test.

/*
* Copyright (c) 2014 VMware, Inc.
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/

/** @file rendering.c
*
* Test rendering with UBOs.  We draw four squares with different positions,
* sizes, rotations and colors where those parameters come from UBOs.
*/

#include "piglit-util-gl.h"

PIGLIT_GL_TEST_CONFIG_BEGIN

config.supports_gl_compat_version = 20;
config.window_visual = PIGLIT_GL_VISUAL_DOUBLE | PIGLIT_GL_VISUAL_RGBA;

PIGLIT_GL_TEST_CONFIG_END

"#extension GL_ARB_uniform_buffer_object : require\n"
"\n"
"layout(std140) uniform;\n"
"uniform ub_pos_size { vec2 pos; float size; };\n"
"uniform ub_rot {float rotation; };\n"
"\n"
"void main()\n"
"{\n"
"   mat2 m;\n"
"   m[0][0] = m[1][1] = cos(rotation); \n"
"   m[0][1] = sin(rotation); \n"
"   m[1][0] = -m[0][1]; \n"
"   gl_Position.xy = m * gl_Vertex.xy * vec2(size) + pos;\n"
"   gl_Position.zw = vec2(0, 1);\n"
"}\n";

"#extension GL_ARB_uniform_buffer_object : require\n"
"\n"
"layout(std140) uniform;\n"
"uniform ub_color { vec4 color; float color_scale; };\n"
"\n"
"void main()\n"
"{\n"
"   gl_FragColor = color * color_scale;\n"
"}\n";

#define NUM_SQUARES 4
#define NUM_UBOS 3

/* Square positions and sizes */
static const float pos_size[NUM_SQUARES][3] = {
{ -0.5, -0.5, 0.1 },
{  0.5, -0.5, 0.2 },
{ -0.5, 0.5, 0.3 },
{  0.5, 0.5, 0.4 }
};

/* Square color and color_scales */
static const float color[NUM_SQUARES][8] = {
{ 2.0, 0.0, 0.0, 1.0,   0.50, 0.0, 0.0, 0.0 },
{ 0.0, 4.0, 0.0, 1.0,   0.25, 0.0, 0.0, 0.0 },
{ 0.0, 0.0, 5.0, 1.0,   0.20, 0.0, 0.0, 0.0 },
{ 0.2, 0.2, 0.2, 0.2,   5.00, 0.0, 0.0, 0.0 }
};

/* Square rotations */
static const float rotation[NUM_SQUARES] = {
0.0,
0.1,
0.2,
0.3
};

static GLuint prog;
static GLuint buffers[NUM_UBOS];
static GLint alignment;
static bool test_buffer_offset = false;

static void
setup_ubos(void)
{
static const char *names[NUM_UBOS] = {
"ub_pos_size",
"ub_color",
"ub_rot"
};
static GLubyte zeros[1000] = {0};
int i;

glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &alignment);
printf("GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT = %d\n", alignment);

if (test_buffer_offset) {
printf("Testing buffer offset %d\n", alignment);
}
else {
/* we use alignment as the offset */
alignment = 0;
}

glGenBuffers(NUM_UBOS, buffers);

for (i = 0; i < NUM_UBOS; i++) {
GLint index, size;

/* query UBO index */
index = glGetUniformBlockIndex(prog, names[i]);

/* query UBO size */
glGetActiveUniformBlockiv(prog, index,
GL_UNIFORM_BLOCK_DATA_SIZE, &size);

printf("UBO %s: index = %d, size = %d\n",
names[i], index, size);

/* Allocate UBO */
/* XXX for some reason, this test doesn't work at all with
* nvidia if we pass NULL instead of zeros here.  The UBO data
* is set/overwritten in the piglit_display() function so this
* really shouldn't matter.
*/
glBindBuffer(GL_UNIFORM_BUFFER, buffers[i]);
glBufferData(GL_UNIFORM_BUFFER, size + alignment,
zeros, GL_DYNAMIC_DRAW);

/* Attach UBO */
glBindBufferRange(GL_UNIFORM_BUFFER, i, buffers[i],
alignment,  /* offset */
size);
glUniformBlockBinding(prog, index, i);

if (!piglit_check_gl_error(GL_NO_ERROR))
piglit_report_result(PIGLIT_FAIL);
}
}

void
piglit_init(int argc, char **argv)
{
piglit_require_extension("GL_ARB_uniform_buffer_object");

if (argc > 1 && strcmp(argv[1], "offset") == 0) {
test_buffer_offset = true;
}

assert(prog);
glUseProgram(prog);

setup_ubos();

glClearColor(0.2, 0.2, 0.2, 0.2);
}

static bool
probe(int x, int y, int color_index)
{
float expected[4];

/* mul color by color_scale */
expected[0] = color[color_index][0] * color[color_index][4];
expected[1] = color[color_index][1] * color[color_index][4];
expected[2] = color[color_index][2] * color[color_index][4];
expected[3] = color[color_index][3] * color[color_index][4];

return piglit_probe_pixel_rgba(x, y, expected);
}

enum piglit_result
piglit_display(void)
{
bool pass = true;
int x0 = piglit_width / 4;
int x1 = piglit_width * 3 / 4;
int y0 = piglit_height / 4;
int y1 = piglit_height * 3 / 4;
int i;

glViewport(0, 0, piglit_width, piglit_height);

glClear(GL_COLOR_BUFFER_BIT);

for (i = 0; i < NUM_SQUARES; i++) {
/* Load UBO data, at offset=alignment */
glBindBuffer(GL_UNIFORM_BUFFER, buffers[0]);
glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(pos_size[0]),
pos_size[i]);
glBindBuffer(GL_UNIFORM_BUFFER, buffers[1]);
glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(color[0]),
color[i]);
glBindBuffer(GL_UNIFORM_BUFFER, buffers[2]);
glBufferSubData(GL_UNIFORM_BUFFER, alignment, sizeof(rotation[0]),
&rotation[i]);

if (!piglit_check_gl_error(GL_NO_ERROR))
return PIGLIT_FAIL;

piglit_draw_rect(-1, -1, 2, 2);
}

pass = probe(x0, y0, 0) && pass;
pass = probe(x1, y0, 1) && pass;
pass = probe(x0, y1, 2) && pass;
pass = probe(x1, y1, 3) && pass;

piglit_present_results();

return pass ? PIGLIT_PASS : PIGLIT_FAIL;
}

The source starts with the license header and describing what the test does. Following those, it’s the config setup for piglit: request doube buffer, RGBA pixel format and GL compat version 2.0.

Furthermore, this test defines GLSL shaders, the global data and helper functions as any other OpenGL C-language program. Notice that setup_ubos() include calls to OpenGL API but also calls to piglit_check_gl_error() and piglit_report_result() which are used to check for any OpenGL error and tell piglit that there was a failure, respectively.

Following the structure introduced before, piglit_init() indicates that ARB_uniform_buffer_object extension is required, it builds the program with the aforementioned shaders, setups the uniform buffer objects and sets the clear color.

Finally in piglit_display() is where the relevant content is placed. Among other things, it loads UBO’s data, draw a rectangle and check that the rendered pixels have the same values as the expected ones. Depending of the result of that check, it will report to piglit that the test was a success or a failure.

# How to add a binary test to piglit

## How to build it

Now that you have written the source code of your test, it’s time to learn how to build it. As we have seen in an earlier post, piglit uses cmake for generating binaries.

First you need to check if cmake tracks the directory where your source code is. If you create a new extension subdirectory under tests/spec, modify this CMakeLists.txt file and add yours.

Once you have done that, cmake needs to know that there are binaries to build and where to get its source code. For doing that, create a CMakeLists.txt file in your test’s directory with the following content:

piglit_include_target_api()

If you have binaries inside directories pending the former, you can add them to the same CMakeLists.txt file you are creating:

add_subdirectory (compiler)
piglit_include_target_api()

But this is not enough to compile your test as we have not specified which binary is and where to get its source code. We do that by creating a new file CMakeLists.gl.txt with a similar content than this one.

include_directories(
${GLEXT_INCLUDE_DIR}${OPENGL_INCLUDE_PATH}
)

piglitutil_${piglit_target_api}${OPENGL_gl_LIBRARY}
)

# vim: ft=cmake:

As you see, first we declare where to find the needed headers and libraries. Then we define the binary name (arb_shader_storage_buffer_object-minmax) and which is its source code file (minmax.c).

And that’s it. If everything is fine, next time you run cmake . && make in piglit’s directory, piglit will build the test and place it inside bin/ directory.

### Example in piglit repository

Let’s review a real example of how a test was added for a new extension in piglit (this commit). As you see, it added tests/spec/arb_robustness/ subdirectory to tests/spec/CMakeLists.txt, create a tests/spec/arb_robustness/CMakeLists.txt to indicate cmake to track this directory, tests/spec/arb_robustness/CMakeLists.gl.txt to compile the binary test file and add its source code file tests/spec/arb_robustness/draw-vbo-bounds.c.

tests/spec/CMakeLists.txt | 1 +
tests/spec/arb_robustness/CMakeLists.gl.txt | 19 +++++++++++++++++++
tests/spec/arb_robustness/CMakeLists.txt | 1 +
tests/spec/arb_robustness/draw-vbo-bounds.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 226 insertions(+)

If you run git log tests/spec/<dir> command in other directories, you will find similar commits.

## How to run it with all.py profile

Once you successfully build the binary test program you can run it standalone. However, it’s better to add it to tests/all.py profile so it is executed automatically by this profile.

Open tests/all.py  and look for your extension name, then add your test by looking at  how other tests are defined and adapt it to your case. In case you are developing tests for a new extension, you have to add more lines to tests/all.py. For example this is how was done for ARB_shader_storage_buffer_object extension:

with profile.group_manager(
PiglitGLTest,
g(['arb_shader_storage_buffer_object-minmax'], 'minmax')

These lines make several things: indicate under which extension you will find the results, which are the binaries to run and which short name will appear in the summaries for each binary.

# Conclusions

This post and last one were about how to create your own piglit tests. I hope you will find them useful and I will be glad to see new contributors because of them

 June 05, 2015

I've recently found myself explaining polkit (formerly PolicyKit) to one of Collabora's clients, and thought that blogging about the same topic might be useful for other people who are confused by it; so, here is why udisks2 and polkit are the way they are.

As always, opinions in this blog are my own, not Collabora's.

## Privileged actions

Broadly, there are two ways a process can do something: it can do it directly (i.e. ask the kernel directly), or it can use inter-process communication to ask a service to do that operation on its behalf. If it does it directly, the components that say whether it can succeed are the Linux kernel's normal permissions checks (DAC), and if configured, AppArmor, SELinux or a similar MAC layer. All very simple so far.

Unfortunately, the kernel's relatively coarse-grained checks are not sufficient to express the sorts of policies that exist on a desktop/laptop/mobile system. My favourite example for this sort of thing is mounting filesystems. If I plug in a USB stick with a FAT filesystem, it's reasonable to expect my chosen user interface to either mount it automatically, or let me press a button to mount it. Similarly, to avoid data loss, I should be able to unmount it when I'm finished with it. However, mounting and unmounting a USB stick is fundamentally the same system call as mounting and unmounting any other filesystem - and if ordinary users can do arbitrary mount system calls, they can cause all sorts of chaos, for instance by mounting a filesystem that contains setuid executables (privilege escalation), or umounting a critical OS filesystem like /usr (denial of service). Something needs to arbitrate: “you can mount filesystems, but only under certain conditions”.

The kernel developer motto for this sort of thing is “mechanism, not policy”: they are very keen to avoid encoding particular environments' policies (the sort of thing you could think of as “business rules”) in the kernel, because that makes it non-generic and hard to maintain. As a result, direct mount/unmount actions are only allowed by privileged processes, and it's up to user-space processes to arrange for a privileged process to make the desired mount syscall.

Here are some other privileged actions which laptop/desktop users can reasonably expect to “just work”, with or without requiring a sysadmin-like (root-equivalent) user:

• reconfiguring networking (privileged because, in the general case, it's an availability and potentially integrity issue)
• installing, upgrading or removing packages (privileged because, in the general case, it can result in arbitrary root code execution)
• suspending or shutting down the system (privileged because you wouldn't want random people doing this on your server, but should normally be allowed on e.g. laptops for people with physical access, because they could just disconnect the power anyway)

In environments that use a MAC framework like AppArmor, actions that would normally be allowed can become privileged: for instance, in a framework for sandboxed applications, most apps shouldn't be allowed to record audio. This prevents carrying out these actions directly, again resulting in the only way to achieve them being to ask a service to carry out the action.

## Ask a system service to do it

On to the next design, then: I can submit a request to a privileged process, which does some checks to make sure I'm not trying to break the system (or alternatively, that I have enough sysadmin rights that I'm allowed to break the system if I want to), and then does the privileged action for me.

You might think I'm about to start discussing D-Bus and daemons, but actually, a prominent earlier implementation of this was mount(8), which is normally setuid root:

% ls -l /bin/mount
-rwsr-xr-x 1 root root 40000 May 22 11:37 /bin/mount


If you look at it from an odd angle, this is inter-process communication across a privilege boundary: I run the setuid executable, creating a process. Because the executable has the setuid bit set, the kernel makes the process highly privileged: its effective uid is root, and it has all the necessary capabilities to mount filesystems. I submit the request by passing it in the command-line arguments. mount does some checks - specifically, it looks in /etc/fstab to see whether the filesystem I'm trying to mount has the “user” or “users” flag - then carries out the mount system call.

There are a few obvious problems with this:

• When machines had a static set of hardware devices (and a sysadmin who knew how to configure them), it might have made sense to list them all in /etc/fstab; but this is not a useful solution if you can plug in any number of USB drives, or if you are a non-expert user with Linux on your laptop. The decision ought to be based on general attributes of devices, such as “is removable?”, and on the role of the machine.
• Setuid executables are alarmingly easy to get wrong so it is not necessarily wise to assume that mount(8) is safe to be setuid.
• One fact that a reasonable security policy might include is “users who are logged in remotely should have less control over physically present devices than those who are physically present” - but that sort of thing can't be checked by mount(8) without specifically teaching the mount binary about it.

### Ask a system service to do it, via D-Bus or other IPC

To avoid the issues of setuid, we could use inter-process communication in the traditional sense: run a privileged daemon (on boot or on-demand), make it listen for requests, and use the IPC channel as our privilege boundary.

udisks2 is one such privileged daemon, which uses D-Bus as its IPC channel. D-Bus is a commonly-used inter-process system; one of its intended/designed uses is to let user processes and system services communicate, especially this sort of communication between a privileged daemon and its less-privileged clients.

People sometimes criticize D-Bus as not doing anything you couldn't do yourself with some AF_UNIX sockets. Well, no, of course it doesn't - the important bit of the reference implementation and the various interoperable reimplementations consists of a daemon and some AF_UNIX sockets, and the rest is a simple matter of programming. However, it's sufficient for most uses in its problem space, and is usually better than inventing your own.

The advantage of D-Bus over doing your own thing is precisely that you are not doing your own thing: good IPC design is hard, and D-Bus makes some structural decisions so that fewer application authors have to think about them. For instance, it has a central “hub” daemon (the dbus-daemon, or “message bus”) so that n communicating applications don't need O(n²) sockets; it uses the dbus-daemon to provide a total message ordering so you don't have to think about message reordering; it has a distributed naming model (which can also be used as a distributed mutex) so you don't have to design that; it has a serialization format and a type system so you don't have to design one of those; it has a framework for “activating" run-on-demand daemons so they don't have to use resources initially, implemented using a setuid helper and/or systemd; and so on.

If you have religious objections to D-Bus, you can mentally replace “D-Bus” with “AF_UNIX or something” and most of this article will still be true.

## Is this OK?

In either case - exec'ing a privileged helper, or submitting a request to a privileged daemon via IPC - the privileged process has two questions that it needs to answer before it does its work:

• what am I being asked to do?
• should I do it?

It needs to make some sort of decision on the latter based on the information available to it. However, before we even get there, there is another layer:

• did the request get there at all?

In the setuid model, there is a simple security check that you can apply: you can make /bin/mount only executable by a particular group, or only executable by certain AppArmor profiles, or similar. That works up to a point, but cannot distinguish between physically-present and not-physically-present users, or other facts that might be interesting to your local security policy. Similarly, in the IPC model, you can make certain communication channels impossible, for instance by using dbus-daemon's ability to decide which messages to deliver, or AF_UNIX sockets' filesystem permissions, or a MAC framework like AppArmor.

Both of these are quite “coarse-grained” checks which don't really understand the finer details of what is going on. If the answer to "is this safe?” is something of the form “maybe, it depends on...”, then they can't do the right thing: they must either let it through and let the domain-specific privileged process do the check, or deny it and lose potentially useful functionality.

For instance, in an AppArmor environment, some applications have absolutely no legitimate reason to talk to udisks2, so the AppArmor policy can just block it altogether. However, once again, this is a coarse-grained check: the kernel has mechanism, not policy, and it doesn't know what the service does or why. If the application does need to be able to talk to the service at all, then finer-grained access control (obeying some, but not all, requests) has to be the service's job.

dbus-daemon does have the ability to match messages in a relatively fine-grained way, based on the object path, interface and member in the message, as well as the routing information that it uses itself (i.e. the source and destination). However, it is not clear that this makes a great deal of sense conceptually: these are facts about the mechanics of the IPC, not facts about the domain-specific request (because the mechanics of the IPC are all that dbus-daemon understands). For instance, taking the udisks2 example again, dbus-daemon can't distinguish between an attempt to adjust mount options for a USB stick (probably fine) and an attempt to adjust mount options for /usr (not good).

To have a domain-specific security policy, we need a domain-specific component, for instance udisks2, to get involved. Unlike dbus-daemon, udisks2 knows that not all disks are equal, knows which categories make sense to distinguish, and can identify which categories a particular disk is in. So udisks2 can make a more informed decision.

So, a naive approach might be to write a function in udisks2 that looks something like this pseudocode:

may_i_mount_this_disk (user, disk, mount options) → boolean
{
if (user is root || user is root-equivalent)
return true;

if (disk is not removable)
return false;

if (mount options are scary)
return false;

if (user is in “manipulate non-local disks” group)
return true;

if (user is not logged-in locally)
return false;

# https://en.wikipedia.org/wiki/Multiseat_configuration
if (user is not logged-in on the same seat where the disk is
plugged in)
return false;

return true;
}

## Delegating the security policy to something central

The pseudocode security policy outlined above is reasonably complicated already, and doesn't necessarily cover everything that you might want to consider.

Meanwhile, not every system is the same. A general-purpose Linux distribution like Debian might run on server/mainframe systems with only remote users, personal laptops/desktops with one root-equivalent user, locked-down corporate laptops/desktops, mobile devices and so on; these systems should not necessarily all have the same security policy.

Another interesting factor is that for some privileged operations, you might want to carry out interactive authorization: ask the requesting user to confirm that the action (which might have come from a background process) should take place (like Windows' UAC), or to prove that the person currently at the keyboard is the same as the person who logged in by giving their password (like sudo).

We could in principle write code for all of this in udisks2, and in NetworkManager, and in systemd, ... - but that clearly doesn't scale, particularly if you want the security policy to be configurable. Enter polkit (formerly PolicyKit), a system service for applying security policies to actions.

The way polkit works is that the application does its domain-specific analysis of the request - in the case of udisks2, whether the device to be mounted is removable, whether the mount options are reasonable, etc. - and converts it into an action. The action gives polkit a way to distinguish between things that are conceptually different, without needing to know the specifics. For instance, udisks2 currently divides up filesystem-mounting into org.freedesktop.udisks2.filesystem-mount, org.freedesktop.udisks2.filesystem-mount-fstab, org.freedesktop.udisks2.filesystem-mount-system and org.freedesktop.udisks2.filesystem-mount-other-seat.

The application also finds the identity of the user making the request. Next, the application sends the action, the identity of the requesting user, and any other interesting facts to polkit. As currently implemented, polkit is a D-Bus service, so this is an IPC request via D-Bus. polkit consults its database of policies in order to choose one of several results:

• yes, allow it
• no, do not allow it
• ask the user to either authenticate as themselves or as a privileged (sysadmin) user to allow it, or cancel authentication to not allow it
• ask the user to authenticate the first time, but if they do, remember that for a while and don't ask again

So how does polkit decide this? The first thing is that it reads the machine-readable description of the actions, in /usr/share/polkit-1/actions, which specifies a default policy. Next, it evaluates a local security policy to see what that says. In the current version of polkit, the local security policy is configured by writing JavaScript in /etc/polkit-1/rules.d (local policy) and /usr/share/polkit-1/rules.d (OS-vendor defaults). In older versions such as the one currently shipped in Debian unstable, there was a plugin architecture; but in practice nobody wrote plugins for it, and instead everyone used the example local authority shipped with polkit, which was configured via files in /etc/polkit-1/localauthority and /etc/polkit-1/localauthority.d.

These policies can take into account useful facts like:

• what is the action we're talking about?
• is the user logged-in locally? are they active, i.e. they are not just on a non-current virtual console?
• is the user in particular groups?

For instance, gnome-control-center on Debian installs this snippet:

polkit.addRule(function(action, subject) {
if ((action.id == "org.freedesktop.locale1.set-locale" ||
action.id == "org.freedesktop.locale1.set-keyboard" ||
action.id == "org.freedesktop.hostname1.set-static-hostname" ||
action.id == "org.freedesktop.hostname1.set-hostname" ||
action.id == "org.gnome.controlcenter.datetime.configure") &&
subject.local &&
subject.active &&
subject.isInGroup ("sudo")) {
return polkit.Result.YES;
}
});


which is reasonably close to being pseudocode for “active local users in the sudo group may set the system locale, keyboard layout, hostname and time, without needing to authenticate”. A system administrator could of course override that by dropping a higher-priority policy for some or all of these actions into /etc/polkit-1/rules.d.

## Summary

• Kernel-based permission checks are not sufficiently fine-grained to be able to express some quite reasonable security policies
• Fine-grained access control needs domain-specific understanding
• The kernel doesn't have that information (and neither does dbus-daemon)
• The privileged service that does the domain-specific thing can provide the domain-specific understanding to turn the request into an action
• polkit evaluates a configurable policy to determine whether privileged services should carry out requested actions
 June 04, 2015

libinput provides a number of different out-of-the-box configurations, based on capabilities. For example: middle mouse button emulation is enabled by default if a device only has left and right buttons. On devices with a physical middle button it is available but disabled by default. Likewise, whether tapping is enabled and/or available depends on hardware capabilities. But some requirements cannot be gathered purely by looking at the hardware capabilities.

libinput uses a couple of udev properties, assigned through udev's hwdb, to detect device types. We use the same mechanism to provide us with specific tags to adjust libinput-internal behaviour. The udev properties named LIBINPUT_MODEL_.... tag devices based on a set of udev rules combined with hwdb matches. For example, we tag Chromebooks with LIBINPUT_MODEL_CHROMEBOOK.

Inside libinput, we parse those tags and use them for model-specific configuration. At the time of writing this, we use the chromebook tag to automatically enable clickfinger behaviour on those touchpads (which matches the google defaults on chromebooks). We tag the Lenovo X230 touchpad to give it it's own acceleration method. This touchpad is buggy and the data it sends has a very bad resolution.

In the future these tags will likely expand and encompass more devices that need customised tweaks. But the goal is always that libinput works well out of the box, even if the hardware is quirky. Introducing these tags instead of a sleigh of configuration options has short-term drawbacks: it increases the workload on us maintainers and it may require software updates to get a device to work exactly the way it should. The long-term benefits are maintainability and testability though, as well as us being more aware of what hardware is out there and how it needs to be fixed. Plus the relief of not having to deal with configuration snippets that are years out of date, do all the wrong things but still spread across forums like an STD.

Note: the tags are considered private API and may change at any time, depending what we want or need to do with them. Do not use them for home-made configuration.

 June 03, 2015

libinput uses udev tags to determine what a device is. This is a significant difference to the X.Org stack which determines how to deal with a device based on an elaborate set of rules, rules grown over time, matured, but with a slight layer of mould on top by now. In evdev's case that is understandable, it stems from a design where you could just point it at a device in your xorg.conf and it'd automagically work, well before we had even input hotplugging in X. What it leads to now though is that the server uses slightly different rules to decide what a device is (to implement MatchIsTouchscreen for example) than evdev does. So you may have, in theory, a device that responds to MatchIsTouchscreen only to set itself up as keyboard.

libinput does away with this in two ways: it punts most of the decisions on what a device is to udev and its ID_INPUT_... properties. A device marked as ID_INPUT_KEYBOARD will initialize a keyboard interface, an ID_INPUT_TOUCHPAD device will initialize a touchpad backend. The obvious advantage of this is that we only have one place where we have generic device type rules. The second advantage is that where this one place isn't precise enough, it's simple to override with custom rule sets. For example, Wacom tablets are hard to categorise just by looking at the device alone. libwacom generates a udev rule containing the VID/PID of each known device with the right ID_INPUT_TABLET etc. properties.

This is a libinput-internal behaviour. Externally, we are a lot more vague. In fact, we don't tell you at all what a device is, other than what events it will send (pointer, keyboard, or touch). We have thought about implementing some sort of device identifier and the conclusion is that we won't implement this as part of libinput's API because it will simply be wrong some of the time. And if it's wrong, it requires the caller to re-implement something on top of it. At which point the caller may as well implement all of it instead. Why do we expect it to be wrong? Because libinput doesn't know the exact context that requires a device to be labelled as a specific type.

Take a keyboard for example. There are a great many devices that send key events. To the client a keyboard may be any device that can get an XKB layout and is used for typing. But to the compositor, a keyboard may be anything that can send a few specific keycodes. A device with nothing but KEY_POWER? That's enough for the compositor to decide to shut down but that device may not otherwise work as a keyboard. libinput can't know this context. But what libinput provides is the API to query information. libinput_device_pointer_has_button() and libinput_device_keyboard_has_key() are the two candidates here to query about a specific set of buttons and/or keys.

Touchpads, trackpoints and mice all look send pointer events and there is no flag that tells you the device type and that is intentional. libinput doesn't have any intrinsic knowledge about what is a touchpad, we take the ID_INPUT_TOUCHPAD tag. At best, we refuse some devices that were clearly mislabelled but we don't init devices as touchpads that aren't labelled as such. Any device type identification would likely be wrong - for example some Wacom tablets are touchpads internally but would be considered tablets in other contexts.

So in summary, callers are encouraged to rely on the udev information and other bits they can pull from the device to group it into the semantically correct device type. libinput_device_get_udev_device() provides a udev handle for a libinput device and all configurable features are query-able (e.g. "does this device support tapping?"). libinput will not provide a device type because it would likely be wrong in the current context anyway.

 June 02, 2015

TLDR: as of libinput 0.16 you can end a touchpad tap-and-drag with a final additional tap

libinput originally only supported single-tap and double-tap. With version 0.15 we now support multi-tap, so you can tap repeatedly to get a triple, quadruple, etc. click. This is quite useful in text editors where a triple click highlights a line, four clicks highlight a paragraph, and 28 clicks order a new touchpad from ebay. Multi-tap also works with drag-and drop, so a triple tap followed by a finger down and hold will send three clicks followed by a single click.

We also support continuous tap-and-drag which is something synaptics didn't support provided with the LockedDrags option: Once the user is in dragging mode (x * tap + hold finger down) they can lift the finger and set it down again without the drag being interrupted. This is quite useful when you have to move across the screen, especially on smaller touchpads or for users that prefer a slow acceleration.

Of course, this adds a timeout to the drag release since we need to wait and see whether the finger comes down again. To help accelerate this, we added a tap-to-release feature (contributed by Velimir Lisec): once in drag mode a final tap will release the button immediately. This is something that OS X has supported for years and after a bit of muscle memory retraining it becomes second nature quickly. So the new timeout-free way to tap-and-drag on a touchpad is now:

   tap, finger-down, move, .... move, finger up, tap

Update 03/06/25: add synaptics LockedDrag option reference

I don't often write about tools I use when for my daily software development tasks. I recently realized that I really should start to share more often my workflows and weapons of choice.

One thing that I have a hard time enduring while doing Python code reviews, is people writing utility code that is not directly tied to the core of their business. This looks to me as wasted time maintaining code that should be reused from elsewhere.

So today I'd like to start with retrying, a Python package that you can use to… retry anything.

## It's OK to fail

Often in computing, you have to deal with external resources. That means accessing resources you don't control. Resources that can fail, become flapping, unreachable or unavailable.

Most applications don't deal with that at all, and explode in flight, leaving a skeptical user in front of the computer. A lot of software engineers refuse to deal with failure, and don't bother handling this kind of scenario in their code.

In the best case, applications usually handle simply the case where the external reached system is out of order. They log something, and inform the user that it should try again later.

In this cloud computing area, we tend to design software components with service-oriented architecture in mind. That means having a lot of different services talking to each others over the network. And we all know that networks tend to fail, and distributed systems too. Writing software with failing being part of normal operation is a terrific idea.

## Retrying

In order to help applications with the handling of these potential failures, you need a plan. Leaving to the user the burden to "try again later" is rarely a good choice. Therefore, most of the time you want your application to retry.

Retrying an action is a full strategy on its own, with a lot of options. You can retry only on certain condition, and with the number of tries based on time (e.g. every second), based on a number of tentative (e.g. retry 3 times and abort), based on the problem encountered, or even on all of those.

For all of that, I use the retrying library that you can retrieve easily on PyPI.

retrying provides a decorator called retry that you can use on top of any function or method in Python to make it retry in case of failure. By default, retry calls your function endlessly until it returns rather than raising an error.

import randomfrom retrying import retry @retrydef pick_one():    if random.randint(0, 10) != 1:        raise Exception("1 was not picked")

This will execute the function pick_one until 1 is returned by random.randint.

retry accepts a few arguments, such as the minimum and maximum delays to use, which also can be randomized. Randomizing delay is a good strategy to avoid detectable pattern or congestion. But more over, it supports exponential delay, which can be used to implement exponential backoff, a good solution for retrying tasks while really avoiding congestion. It's especially handy for background tasks.

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000)def wait_exponential_1000():    print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"    raise Exception("Retry!")

You can mix that with a maximum delay, which can give you a good strategy to retry for a while, and then fail anyway:

# Stop retrying after 30 seconds anyway>>> @retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_delay=30000)... def wait_exponential_1000():...     print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"...     raise Exception("Retry!")...>>> wait_exponential_1000()Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsWait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsWait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsWait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsWait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsWait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwardsTraceback (most recent call last):  File "<stdin>", line 1, in <module>  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f    return Retrying(*dargs, **dkw).call(f, *args, **kw)  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call    raise attempt.get()  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get    six.reraise(self.value[0], self.value[1], self.value[2])  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)  File "<stdin>", line 4, in wait_exponential_1000  Exception: Retry!

A pattern I use very often, is the ability to retry only based on some exception type. You can specify a function to filter out exception you want to ignore or the one you want to use to retry.

def retry_on_ioerror(exc):    return isinstance(exc, IOError) @retry(retry_on_exception=retry_on_ioerror)def read_file():    with open("myfile", "r") as f:        return f.read()

retry will call the function passed as retry_on_exception with the exception raised as first argument. It's up to the function to then return a boolean indicating if a retry should be performed or not. In the example above, this will only retry to read the file if an IOError occurs; if any other exception type is raised, no retry will be performed.

The same pattern can be implemented using the keyword argument retry_on_result, where you can provide a function that analyses the result and retry based on it.

def retry_if_file_empty(result):    return len(result) <= 0 @retry(retry_on_result=retry_if_file_empty)def read_file():    with open("myfile", "r") as f:        return f.read()

This example will read the file until it stops being empty. If the file does not exist, an IOError is raised, and the default behavior which triggers retry on all exceptions kicks-in – the retry is therefore performed.

That's it! retry is really a good and small library that you should leverage rather than implementing your own half-baked solution!

 May 29, 2015

In earlier posts I talked about how to install piglit on your system and run the first time and how to tailor your piglit run. I have given a talk in FOSDEM 2015 about how to test your OpenGL drivers using Free Software where I explained how to run piglit and dEQP. This post and the next one are going to introduce the topic of creating new tests for piglit, as this was not covered before in this post series.

# Brief introduction

Before start talking about how to create a GLSL test, let me explain you a couple of things about how piglit organizes its tests.

First is that all the tests are defined inside tests/ directory.

• The ones related to a specific spec (like OpenGL extension, GL version, GLSL version, etc) are placed inside tests/spec subdirectory.
• Shader tests are usually defined inside  tests/shaders/ directory, but not always. If they are specific to a OpenGL extension, they will be in the its subdirectory (see first bullet).
• Tests specific to other APIs: tests/egl, tests/cl, etc.
• Etc.

Take your time browsing all these directories and figure out the best place for your tests or, if you are looking for a specific test, where it is likely placed.

Second thing is that binary tests should be defined inside tests/all.py to be executed in a full piglit run. This is not the case for GLSL tests or shader runner tests as we are going to see in this post.

The most common language to write these tests is C but there are a couple of alternatives if you are interested on writing GLSL shader tests. Let’s start with a couple of examples of GLSL shader tests as it is the easiest way to contribute new tests to piglit.

# GLSL compiler test

When creating a GLSL compiler test, most of the time you can avoid to write a C code example with all those bells and whistles. Actually, it is pretty straightforward if you just want to check that a GLSL shader compiles successfully or not according to your expectations.

Usually, the rule of thumb is to keep GLSL tests simple and focused to detect failures (or success) in shaders’ compilation. For example, you want to check that defining a shader storage buffer block is only possible if ARB_shader_storage_buffer_object extension is enabled in the driver.

#version 120

buffer ssbo {
vec4 a;
};

void foo(void) {
}

Once you write the GLSL shader, save it to a file named like extension-disabled-shader-storage-block.frag. There are several suffixes that you can use depending of the GLSL shader type: .vert for vertex shader, .geom for geometry shader, .frag for fragment shader. Notice that the name itself is a summary of what the test does, so other people can find what it does without opening the file.

There is a piglit binary called glslparser that it is going to pick this shader and compile it checking for errors. This binary needs some parameters to know how to run this test and the expected result, so we provide them. Add at the top of the shader the following content:

// [config]
// expect_result: fail
// glsl_version: 1.20
// [end config]

There we write down that we expect the test to fail, which is the minimum supported GLSL version to run this test and the required extensions. At this case we need GLSL 1.20 version and GL_ARB_shader_storage_buffer_object extension.

// [config]
// expect_result: fail
// glsl_version: 1.20
// [end config]

#version 120

buffer ssbo {
vec4 a;
};

void foo(void) {
}

Once you have it, you should save it in the proper place. The directory will be the same than the extension name it is going to test, usually it is saved in a subdirectory called compiler because we are testing the GLSL shader compilation, for this example: tests/spec/arb_shader_storage_buffer_object/compiler.

And that’s it. Next time you run piglit with tests/all.py profile, one script will find this test, parse it and execute glslparser with the provided configuration checking if the result of your shader is the expected one or not.

You can execute it manually by running the following command:

$bin/glslparser tests/spec/arb_shader_storage_buffer_object/compiler/extension-disabled-shader-storage-block.frag fail 1.20 GL_ARB_shader_storage_buffer_object As you see, the last arguments come from the config we defined in the test file but we added them by hand. It is possible to link a GLSL shader and look for errors by using check_link with true or false, depending if you expect the test to succeed or to fail at link time: // check_link: true Nevertheless, it only links one shader. If you want to link more than one shader, I recommend you to use another type of piglit tests (see next section). As you see, you can easily write new GLSL shader tests to verify some bits of a specification: valid/invalid keywords, initialization, special cases explained in the spec, etc. # GLSL shader linker test Sometimes compiling/linking only one shader is not enough, there are cases that you want to compile and link different but related shaders: link a vertex and a fragment shader or a geometry shader between them, etc. In the previous example, we saw that the GLSL shader test does not link different shaders (actually there is only one). When there are several GLSL shaders, you need to use another type of piglit tests: shader_runner tests. As usual, you first start writing the GLSL shaders. In some cases, you can substitute one shader by its pass-through counterpart to avoid writing “useless” shaders when there is not any user provided interface dependency. Let’s start with an example of a pass-through vertex shader: # ARB_gpu_shader5 spec says: # "If an implementation supports vertex streams, the individual # streams are numbered 0 through -1" # # This test verifies that a link error occurs if EmitStreamVertex() # is called with a stream value which is negative. [require] GLSL >= 1.50 GL_ARB_gpu_shader5 [vertex shader passthrough] [geometry shader] #extension GL_ARB_gpu_shader5 : enable layout(points) in; layout(points, max_vertices=3) out; void main() { gl_Position = vec4(1.0, 1.0, 1.0, 1.0); EmitStreamVertex(-1); EndStreamPrimitive(-1); } [fragment shader] out vec4 color; void main() { color = vec4(0.0, 1.0, 0.0, 1.0); } [test] link error The file starts with a comment describing what the test does and which spec (or part of it) is checking. Next step is very similar to the previous example: specify the minimum GL and/or GLSL version required and the extensions needed for the test execution. This information is written in the [require] section. Then the different GLSL shader definitions start: vertex shader is pass-through as it has no user-defined interface dependency to next stage (geometry shader at this case). After it, it defines the geometry and fragment shaders and finally the test commands to run. In this case, we expect the test to fail at link time. However you are not limited to link checking, there are other available commands to run in the test commands part: • Load uniform values • uniform <type> <name> <value> uniform vec4 color 0.0 0.5 0.0 0.0 • Draw rectangles • draw rect <coordinates> draw rect -1 -1 2 2 • Probe a pixel color value and compare it to an expected value • probe {rgb, rgba} probe rgb 1 1 0.0 1.0 0.0 • Probe all pixel values. • probe all {rgb,rgba} probe all rgba 0.0 1.0 0.0 0.0 Or even more complex commands: • Load data into a vertex array object and render primitives using that vertex array data. [vertex data] vertex/float/3 1.0 1.0 1.0 -1.0 1.0 1.0 -1.0 -1.0 1.0 1.0 -1.0 1.0 [test] draw arrays GL_TRIANGLE_FAN 0 4 • Relative probe pixel color relative probe rgb (.5, .5) (0.0, 1.0, 0.0) • And much more: [fragment program] !!ARBfp1.0 OPTION ARB_fragment_program_shadow; TXP result.color, fragment.texcoord[0], texture[0], SHADOWRECT; END [test] texture shadowRect 0 (32, 32) texparameter Rect depth_mode luminance texparameter Rect compare_func greater Just check other tests to figure out if they are doing something like what you want to do in your test and copy the interesting bits! Save the file with a good name describing briefly what it does (example’s filename is stream-negative-value.shader_test) and place it in the corresponding subdirectory. For this example, the proper place is under GL_ARB_gpu_shader5 test directory and linker subdirectory because this is a linker test: tests/spec/arb_gpu_shader5/linker/ In order to execute this test with tests/all.py profile, you don’t need to do anything else. A script will find this test file and call shader_runner binary with the filename as a parameter. In case you want to run it by yourself, the command is easy: $ bin/shader_runner tests/spec/arb_gpu_shader5/linker/stream-negative-value.shader_test

Next post will cover the creation of GL binary tests within the piglit framework. I hope this post and next one will help you to contribute to piglit!

 May 26, 2015

So we just got the second Fedora Workstation release out the door, and I am quite happy with it, we had quite a few last minute hardware issues pop up, but due to the hard work of the team we where able to get them fixed in time for todays release.

Every release we do is of course the result of both work we do as part of the Fedora Workstation team, but we also rely on a lot of other people upstream. I would like to especially call out Laurent Pinchart, who is the upstream maintainer of the UVC driver, who fixed a bug we discovered with some built in webcams on newer laptops. So thank you Laurent! So for any users of the Toshiba z20t Portege laptop, your rear camera now works thanks to Laurent

Having a relatively short development cycle this release doesn’t contain huge amounts of major changes, but our team did manage to sneak in a few nice new features. As you can see from this blog entry from Allan Day the notification area re-design that he and Florian worked on landed. It is a huge improvement in my opinion and will let us continue polishing the notification behavior of applications going forward.

We have a bunch of improvements to the Nautilus file manager thanks to the work of Carlos Soriano. Recommend reading through his blog as there is a quite sizeable collection of smaller fixes and improvements he was able to push through.

Another thing we got properly resolved for Fedora Workstation 22 is installing it in Boxes. Boxes is our easy to use virtual machine manager which we are putting resources into to make a great developer companion. So while this is a smaller fix for Boxes and Fedora, we have some great Boxes features lining up for the next Fedora release, so stayed tuned for more on that in another blog post.

Wayland support is also marching forward with this release. The GDM session you get upon installing Fedora Workstation 22 will now default to Wayland, but fall back to X if there is an issue. It is a first step towards migrating the default session to Wayland. We still have some work to do there to get the Wayland session perfect, but we are closing the gap rapidly. Jonas Ådahl and Owen Taylor is pushing that effort forward.

Related to Wayland we introduce libinput as the backend for both X and Wayland in this release. While we shipped libinput in Fedora 21, when we wrote libinput we did so with Wayland as the primary target, yet at the same time we realized that we didn’t want to maintain two separate input systems going forward, so in this release also X.org uses libinput for input. This means we have one library to work on now that will improve input in both your Wayland session and X sessions.

This is also the first release featuring the new Adwaita theme for Qt. This release supports Qt4, but we hope to support Qt5 in an upcoming Fedora release and also include a dark variant of the theme for Qt applications. Martin Briza has been leading that effort.

Anyway, we have a lot of features in the pipeline now for Fedora Workstation 23 since quite a few of the items planned for Fedora Workstation 22 didn’t get completed in time, so I am looking forward to writing a blog informing you about those soon.

Last week I was in Vancouver, BC for the OpenStack Summit, discussing the new Liberty version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things.

## Ops feedback

We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack.

We discussed the first point as being addressed by Gnocchi, and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far.

Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance… to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the introduction of jitter in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time.

## Componentisation

When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer – nor even OpenStack itself.

My fellow developer Chris Dent had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns.

Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem.

I wrote a specification on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own.

This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data.

## Gnocchi

We didn't have a Gnocchi dedicated slot – mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support InfluxDB more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster.

All of those should – plus a few Oslo tasks I'd like to tackle – should keep me busy for the next cycle!

 May 25, 2015

Last week during the the OpenStack Summit in Vancouver, Intel organized a Rule the Stack contest. That's the third one, after Atlanta a year ago and Paris six months ago. In case you missed earlier episodes, SUSE won the two previous contests with Dirk being pretty fast in Atlanta and Adam completing the HA challenge so we could keep the crown. So of course, we had to try again!

For this contest, the rules came with a list of penalties and bonuses which made it easier for people to participate. And indeed, there were quite a number of participants with the schedule for booking slots being nearly full. While deploying Kilo was a goal, you could go with older releases getting a 10 minutes penalty per release (so +10 minutes for Juno, +20 minutes for Icehouse, and so on). In a similar way, the organizers wanted to see some upgrade and encouraged that with a bonus that could significantly impact the results (-40 minutes) — nobody tried that, though.

And guess what? SUSE kept the crown again. But we also went ahead with a new challenge: outperforming everyone else not just once, but twice, with two totally different methods.

For the super-fast approach, Dirk built again an appliance that has everything pre-installed and that configures the software on boot. This is actually not too difficult thanks to the amazing Kiwi tool and all the knowledge we have accumulated through the years at SUSE about building appliances, and also the small scripts we use for the CI of our OpenStack packages. Still, it required some work to adapt the setup to the contest and also to make sure that our Kilo packages (that were brand new and without much testing) were fully working. The clock result was 9 minutes and 6 seconds, resulting in a negative time of minus 10 minutes and 54 seconds (yes, the text in the picture is wrong) after the bonuses. Pretty impressive.

But we also wanted to show that our product would fare well, so Adam and I started looking at this. We knew it couldn't be faster than the way Dirk picked, and from the start, we targetted the second position. For this approach, there was not much to do since this was similar to what he did in Paris, and there was work to update our SUSE OpenStack Cloud Admin appliance recently. Our first attempt failed miserably due to a nasty bug (which was actually caused by some unicode character in the ID of the USB stick we were using to install the OS... we fixed that bug later in the night). The second attempt went smoother and was actually much faster than we had anticipated: SUSE OpenStack Cloud deployed everything in 23 minutes and 17 seconds, which resulted in a final time of 10 minutes and 17 seconds after bonuses/penalties. And this was with a 10 minutes penalty due to the use of Juno (as well as a couple of minutes lost debugging some setup issue that was just mispreparation on our side). A key contributor to this result is our use of Crowbar, which we've kept improving over time, and that really makes it easy and fast to deploy OpenStack.

Wall-clock time for SUSE OpenStack Cloud

These two results wouldn't have been possible without the help of Tom and Ralf, but also without the whole SUSE OpenStack Cloud team that works on a daily basis on our product to improve it and to adapt it to the needs of our customers. We really have an awesome team (and btw, we're hiring)!

For reference, three other contestants succeeded in deploying OpenStack, with the fastest of them ending at 58 minutes after bonuses/penalties. And as I mentioned earlier, there were even more contestants (including some who are not vendors of an OpenStack distribution), which is really good to see. I hope we'll see even more in Tokyo!

Results of the Rule the Stack contest

Also thanks to Intel for organizing this; I'm sure every contestant had fun and there was quite a good mood in the area reserved for the contest.

# Overview

Pictures are the right way to start.

Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

## The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it’s easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I’ll also assume you’ve read, or understand part 1.

## Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.

# Architecture

The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don’t want to rehash the HSW PRMs  too much, and I am probably not allowed to won’t copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT – the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table.  There are several good diagrams which I won’t bother redrawing in the PRMs2

31:12, 11:04, 03:02, 01, 0
Physical Page Address 31:12, Physical Page Address 39:32, Rsvd, Page size (4K/32K), Valid
[/easytable]

31:12, 11, 10:04, 03:01, 0
Physical Page Address 31:12, Cacheability Control[3], Physical Page Address 38:32, Cacheability Control[2:0], Valid
[/easytable]

There’s some things we can get from this for those too lazy to click on the links to the docs.

1. PPGTT page tables exist in physical memory.
2. PPGTT PTEs have the exact same layout as GGTT PTEs.
3. PDEs don’t have cache attributes (more on this later).
4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I’ve not yet defined the size).

• A PDE is 4 bytes wide
• A PTE is 4 bytes wide
• A Page table occupies 4k of memory.
• There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.

An object with an aliased PPGTT mapping

## Size

PP_DCLV – PPGTT Directory Cacheline Valid Register: As the spec tells us, “This register controls update of the on-chip PPGTT Directory Cache during a context restore.” This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

63:32,31:0

MBZ, “PPGTT Directory Cache Restore [1..32] 16 entries”

[/easytable]

31, 30, …, 1, 0

PDE[511:496] enable, PDE [495:480] enable,…,PDE[31:16] enable, PDE[15:0] enable

[/easytable]

The, “why” is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit.  We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB

## Location

PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, “This register contains the offset into the GGTT where the (current context’s) PPGTT page directory begins.” We learn a very important caveat about the PPGTT here – the PPGTT PDEs reside within the GGTT.

## Programming

With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It’s actually pretty ho-hum.

...
ret = intel_ring_begin(ring, 6);
if (ret)
return ret;

intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);
...


As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU’s LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU’s ringbuffer. If it’s easier just pretend it’s 2 MMIO writes.

### Initialization

All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven’t changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&amp;dev_priv-&gt;gtt.base.mm,
&amp;ppgtt-&gt;node, GEN6_PD_SIZE,
GEN6_PD_ALIGN, 0,
0, dev_priv-&gt;gtt.base.total,
DRM_MM_TOPDOWN);


2. Allocate the page tables

for (i = 0; i &lt; ppgtt-&gt;num_pd_entries; i++) {
ppgtt-&gt;pt_pages[i] = alloc_page(GFP_KERNEL);
if (!ppgtt-&gt;pt_pages[i]) {
gen6_ppgtt_free(ppgtt);
return -ENOMEM;
}
}


3. [possibly] IOMMU map the pages

for (i = 0; i &lt; ppgtt-&gt;num_pd_entries; i++) {

pt_addr = pci_map_page(dev-&gt;pdev, ppgtt-&gt;pt_pages[i], 0, 4096,
PCI_DMA_BIDIRECTIONAL);
...
}


As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.

### IOMMU

As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn’t fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware’s initialization. This means that as long as Linux treats that memory as special, everything will just work (just don’t look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don’t want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that’s all.It tripped be up the first time I saw it because I hadn’t dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity – I’d probably need another diagram to accommodate.

## Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register.  Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
ecochk |= ECOCHK_PPGTT_WB_HSW;
} else {
ecochk |= ECOCHK_PPGTT_LLC_IVB;
ecochk &amp;= ~ECOCHK_PPGTT_GFDT_IVB;
}
I915_WRITE(GAM_ECOCHK, ecochk);


What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I’ve chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.

Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram – you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.

## Distinctions from the GGTT

At this point I hope you’re asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

• Display: Display actually implements it’s own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
• i915_gem_object_pin_to_display_plane(): page flipping
• intel_setup_overlay(): overlays
• Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn’t make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware’s execution logic – which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
• allocate_ring_buffer()
• HW Contexts: Extremely similar to ringbuffer.
• intel_alloc_context_page(): Ironlake RC6
• i915_gem_create_context(): Create the default HW context
• i915_gem_context_reset(): Re-pin the default HW context
• do_switch(): Pin the logical context we’re switching to
• Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
• init_status_page()
• Workarounds:
• init_pipe_control(): Initialize scratch space for workarounds.
• intel_init_render_ring_buffer(): An i830 w/a I won’t bother to understand
• render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
• Other
• i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
• i915_gem_pin_ioctl(): The DRI1 pin interface.

# GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

• PTE size increased to 8b
• Therefore, 512 entries per table
• Format mimics the CPU PTEs
• PDEs increased to 8b (remains 512 PDEs per PD)
• Page Directories live in system memory
• GGTT no longer holds the PDEs.
• There are 4 PDPs, and therefore 4 PDs
• PDEs are cached in LLC instead of special cache (I’m guessing)
• New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
• PP_DIR_BASE, and PP_DCLV are removed
• Support for 4 level page tables, up to 48b virtual address space.
• PML4[PML4E]->PDP
• PDP[PDPE] -> PD
• PD[PDE] -> PT
• PT{PTE] -> Memory
• Big pages are now 64k instead of 32k (still not implemented)
• New caching interface via PAT like structure

# Summary

There’s actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn’t be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I’d put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.

Recapping:

• The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
• Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
• GEN8 changed a lot of architectural stuff.
• The Aliasing PPGTT shouldn’t actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I’ve created. Use them as you like.

1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT

2. Big pages have the same goal as they do on the CPU – to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn’t a big win to be had for many workloads we care about, and so this remains a low priority.

3. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches

4. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry(

# Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you’re going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I’d write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There’s no indication about the “upstreamability”. What this means is that if you read my blog to understand how the i915 driver currently works, you’ll be taking a crap-shoot on this one.
2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can’t refer to them directly, I can only refer to the code.
3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn’t possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it’s still fresh (for some very warped definition of “fresh”).

# Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

## Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn’t nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don’t have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

## Present: Relocations

Throughout many of my previous blog posts I’ve gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It’s impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

To get to the above state, something like the following would happen.

1. Create BOx
2. Create BOy
3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
4. Do one of aforementioned operations on BOx and BOy
5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

## Future: No relocations

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it’s still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and
kernels executing on devices to directly share complex, pointer-containing data structures such
as trees and linked lists. It also eliminates the need to marshal data between the host and devices.
As a result, SVM substantially simplifies OpenCL programming and may improve performance."


# Makin’ it Happen

## 64b

As I’ve already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you’re probably left with a big ‘WTF’ face. I probably can’t help you if you’re in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you’d expect them to. Instead of the x86 CPU’s CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

## “This will take one week. I can just allocate everything up front.” (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. “One week. I can just allocate everything up front.” If what I have is, “done” then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 5122 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops


### Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can’t use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
}

static inline int pte_present(pte_t a)
{
return pte_flags(a) &amp; (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
}

static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current-&gt;mm;

printk(&quot;Page is present: %s\n&quot;, pte_present(*ptep) ? &quot;yes&quot; : &quot;no&quot;);


X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don’t know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don’t want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn’t do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

### Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It’s difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn’t a good solution, but at that time I just wanted to get the thing working and didn’t really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won’t work. For the following, think of the four level page tables as arrays. ie.

• PML4[0-255], each point to a PDP
• PDP[0-255][0-511], each point to a PD
• PD[0-255][0-511][0-511], each point to a PT
• PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0 is free.
2. [i915]Allocate PDP for PML4[0]
3. [i915]Allocate PD for PDP[0][0]
4. [i915]Allocate PT for PD[0][0][0]/li>
5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO’s backing page.
3. [i915] Dispatch work to the GPU on behalf of mesa.
4. [i915] Observe the hardware has completed
5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0x200000 is free.
2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
4. Abort.

### Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn’t try it.

After I had determined I couldn’t reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went – none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
$\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes$
That’s 4GB simply to track every page. There’s some more overhead because page [tables, directories, directory pointers] are also tracked.
$256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \\ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\ 4294967296bytes + 8405024bytes = 4303372320bytes \\ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G$

I can’t remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn’t see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you’re using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn’t fall into that trap, and I just wasted your time, but there it is anyway.

#### Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would “walk” to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
...
}

/**
* alloc_page_tables - Allocate all page tables for the given virtual address.
*
* This will allocate all the necessary page tables to map exactly one page at
* @address. The page tables will not be connected, and the PTE will not point
* to a page.
*
* @ppgtt:	The PPGTT structure encapsulating the virtual address space.
*
*/
static void
{
struct i915_pagetab *pt;
struct i915_pagedir *pd;
struct i915_pagedirpo *pdp;
struct i915_pml4 *pml4 = &amp;ppgtt-&gt;pml4; /* Always there */

int pte = (address &amp; I915_PDES_PER_PD);

if (!test_bit(pml4e, pml4-&gt;used_pml4es))
goto alloc;

pdp = pml4-&gt;pagedirpo[pml4e];
if (!test_bit(pdpe, pdp-&gt;used_pdpes;))
goto alloc;

pd = pdp-&gt;pagedirs[pdpe];
if (!test_bit(pde, pd-&gt;used_pdes)
goto alloc;

pt = pd-&gt;page_tables[pde];
if (test_bit(pte, pt-&gt;used_ptes))
return;

alloc_pdp:
pdp = alloc_one_pdp(pml4, pml4e);
set_bit(pml4e, pml4-&gt;used_pml4es);
alloc_pd:
pd = alloc_one_pd(pdp, pdpe);
set_bit(pdpe, pdp-&gt;used_pdpes);
alloc_pt:
pt = alloc_one_pt(pd, pde);
set_bit(pde, pd-&gt;used_pdes);
}


Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

# The GPU mirroring interface

I really don’t want to spend too much time here. In other words, no more pictures. As I’ve already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I’ve submitted, 2 changes were made to the existing userptr interface (which wasn’t then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
__u64 user_ptr;
__u64 user_size;
__u32 ctx_id;
__u32 flags;
#define I915_USERPTR_GPU_MIRROR         (1&lt;&lt;1)
#define I915_USERPTR_UNSYNCHRONIZED     (1&lt;&lt;31)
/**
* Returned handle for the object.
*
* Object handles are nonzero.
*/
__u32 handle;
};


The context argument is to tell the i915 driver for which address space we’ll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

• Pros:
• This interface is very simple.
• Existing userptr code does the hard work for us
• Cons:
• You need 1 IOCTL per object. Much undeeded overhead.
• It’s subject to a lot of problems userptr has5
• Userptr was already merged, so unless pad get’s repruposed, we’re screwed

## What should be: soft pin

There hasn’t been too much discussion here, so it’s hard to say. I believe the trends of the discussion (and the author’s personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, “soft pin.” It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can’t merged until there is an open source userspace. Stay tuned. Perhaps I’ll update the blog as the story unfolds.

# Wrapping it up (all 4 parts)

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, “buy me a beer if you liked this”, I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN"

2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)

3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect.

4. I am not sure how KVM works manages page tables. At least conceptually I’d think they’d have a similar problem to the i915 driver’s page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn’t have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took

5. Let me be clear that I don’t think userptr is a bad thing. It’s a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring

 May 22, 2015
Modern (and some less modern) laptops and tablets have a lot of builtin sensors: accelerometer for screen positioning, ambient light sensors to adjust the screen brightness, compass for navigation, proximity sensors to turn off the screen when next to your ear, etc.

Enabling

We've supported accelerometers in GNOME/Linux for a number of years, following work on the WeTab. The accelerometer appeared as an input device, and sent kernel events when the orientation of the screen changed.

Recent devices, especially Windows 8 compatible devices, instead export a HID device, which, under Linux, is handled through the IIO subsystem. So the first version of iio-sensor-proxy took readings from the IIO sub-system and emulated the WeTab's accelerometer: a few too many levels of indirection.

The 1.0 version of the daemon implements a D-Bus interface, which means we can support more than accelerometers. The D-Bus API, this time, is modelled after the Android and iOS APIs.

Enjoying

Accelerometers will work in GNOME 3.18 as well as it used to, once a few bugs have been merged[1]. If you need support for older versions of GNOME, you can try using version 0.1 of the proxy.

Orientation lock in action

As we've adding ambient light sensor support in the 1.0 release, time to put in practice best practice mentioned by Owen's post about battery usage. We already had code like that in gnome-power-manager nearly 10 years ago, but it really didn't work very well.

The major problem at the time was that ambient light sensor reading weren't in any particular unit (values had different meanings for different vendors) and the user felt that they were fighting against the computer for the control of the backlight.

Richard fixed that though, adapting work he did on the ColorHug ALS sensor, and the brightness is now completely in the user's control, and adapts to the user's tastes. This means that we can implement the simplest of UIs for its configuration.

Power saving in action

This will be available in the upcoming GNOME 3.17.2 development release.

For future versions, we'll want to export the raw accelerometer readings, so that applications, including games, can make use of them, which might bring up security issues. SDL, Firefox, WebKit could all do with being adapted, in the near future.

We're also looking at adding compass support (thanks Elad!), which Geoclue will then export to applications, so that location and heading data is collected through a single API.

Richard and Benjamin Tissoires, of fixing input devices fame, are currently working on making the ColorHug-ALS compatible with Windows 8, meaning it would work out of the box with iio-sensor-proxy.

We're currently using GitHub for bug and code tracking. Releases are mirrored on freedesktop.org, as GitHub is known to mangle filenames. API documentation is available on developer.gnome.org.

[1]: gnome-settings-daemon, gnome-shell, and systemd will need patches
 May 20, 2015

In the last weeks I have been working together with my colleague Samuel on bringing support for ARB_shader_storage_buffer_object, an OpenGL 4.3 feature, to Mesa and the Intel i965 driver, so I figured I would write a bit on what this brings to OpenGL/GLSL users. If you are interested, read on.

This extension introduces the concept of shader storage buffer objects (SSBOs), which is a new type of OpenGL buffer. SSBOs allow GL clients to create buffers that shaders can then map to variables (known as buffer variables) via interface blocks. If you are familiar with Uniform Buffer Objects (UBOs), SSBOs are pretty similar but:

• They allow a number of atomic operations on them.
• They allow an optional unsized array at the bottom of their definitions.

Since SSBOs are read/write, they create a bidirectional channel of communication between the GPU and CPU spaces: the GL application can set the value of shader variables by writing to a regular OpenGL buffer, but the shader can also update the values stored in that buffer by assigning values to them in the shader code, making the changes visible to the GL application. This is a major difference with UBOs.

In a parallel environment such as a GPU where we can have multiple shader instances running simultaneously (processing multiple vertices or fragments from a specific rendering call) we should be careful when we use SSBOs. Since all these instances will be simultaneously accessing the same buffer there are implications to consider relative to the order of reads and writes. The spec does not make many guarantees about the order in which these take place, other than ensuring that the order of reads and writes within a specific execution of a shader is preserved. Thus, it is up to the graphics developer to ensure, for example, that each execution of a fragment or vertex shader writes to a different offset into the underlying buffer, or that writes to the same offset always write the same value. Otherwise the results would be undefined, since they would depend on the order in which writes and reads from different instances happen in a particular execution.

The spec also allows to use glMemoryBarrier() from shader code and glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) from a GL application to add sync points. These ensure that all memory accesses to buffer variables issued before the barrier are completely executed before moving on.

Another tool for developers to deal with concurrent accesses is atomic operations. The spec introduces a number of new atomic memory functions for use with buffer variables: atomicAdd, atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor, atomicExchange (atomic assignment to a buffer variable), atomicCompSwap (atomic conditional assignment to a buffer variable).

The optional unsized array at the bottom of an SSBO definition can be used to push a dynamic number of entries to the underlying buffer storage, up to the total size of the buffer allocated by the GL application.

Using shader storage buffer objects (GLSL)

Okay, so how do we use SSBOs? We will introduce this through an example: we will use a buffer to record information about the fragments processed by the fragment shader. Specifically, we will group fragments according to their X coordinate (by computing an index from the coordinate using a modulo operation). We will then record how many fragments are assigned to a particular index, the first fragment to be assigned to a given index, the last fragment assigned to a given index, the total number of fragments processed and the complete list of fragments processed.

To store all this information we will use the SSBO definition below:

layout(std140, binding=0) buffer SSBOBlock {
vec4 first[8];     // first fragment coordinates assigned to index
vec4 last[8];      // last fragment coordinates assigned to index
int counter[8];    // number of fragments assigned to index
int total;         // number of fragments processed
vec4 fragments[];  // coordinates of all fragments processed
};


Notice the use of the keyword buffer to tell the compiler that this is a shader storage buffer object. Also notice that we have included an unsized array called fragments[], there can only be one of these in an SSBO definition, and in case there is one, it has to be the last field defined.

In this case we are using std140 layout mode, which imposes certain alignment rules for the buffer variables within the SSBO, like in the case of UBOs. These alignment rules may help the driver implement read/write operations more efficiently since the underlying GPU hardware can usually read and write faster from and to aligned addresses. The downside of std140 is that because of these alignment rules we also waste some memory and we need to know the alignment rules on the GL side if we want to read/write from/to the buffer. Specifically for SSBOs, the specification introduces a new layout mode: std430, which removes these alignment restrictions, allowing for a more efficient memory usage implementation, but possibly at the expense of some performance impact.

The binding keyword, just like in the case of UBOs, is used to select the buffer that we will be reading from and writing to when accessing these variables from the shader code. It is the application’s responsibility to bound the right buffer to the binding point we specify in the shader code.

So with that done, the shader can read from and write to these variables as we see fit, but we should be aware of the fact that multiple instances of the shader could be reading from and writing to them simultaneously. Let’s look at the fragment shader that stores the information we want into the SSBO:

void main() {
int index = int(mod(gl_FragCoord.x, 8));

if (i == 0)
first[index] = gl_FragCoord;
else
last[index] = gl_FragCoord;

fragments[i] = gl_FragCoord;
}


The first line computes an index into our integer array buffer variable by using gl_FragCoord. Notice that different fragments could get the same index. Next we increase in one unit counter[index]. Since we know that different fragments can get to do this at the same time we use an atomic operation to make sure that we don’t lose any increments.

If this is not the first fragment assigned to that index, we write to last[index] instead. Again, multiple fragments assigned to the same index could do this simultaneously, but that is okay here, because we only care about the the last write. Also notice that it is possible that different executions of the same rendering command produce different values of first[] and last[].

The remaining instructions unconditionally push the fragment coordinates to the unsized array. We keep the last index into the unsized array fragments[] we have written to in the buffer variable total. Each fragment will atomically increase total before writing to the unsized array. Notice that, once again, we have to be careful when reading the value of total to make sure that each fragment reads a different value and we never have two fragments write to the same entry.

Using shader storage buffer objects (GL)

On the side of the GL application, we need to create the buffer, bind it to the appropriate binding point and initialize it. We do this as usual, only that we use the new GL_SHADER_STORAGE_BUFFER target:

typedef struct {
float first[8*4];      // vec4[8]
float last[8*4];       // vec4[8]
int counter[8*4];      // int[8] padded as per std140
int total;             // int
char fragments[1024];  // up to 1024 bytes of unsized array
} SSBO;

SSBO data;

(...)

memset(&data, 0, sizeof(SSBO));

GLuint buf;
glGenBuffers(1, &buf);


The code creates a buffer, binds it to binding point 0 of GL_SHADER_STORAGE_BUFFER (the same we have bound our shader to) and initializes the buffer data to 0. Notice that because we are using std140 we have to be aware of the alignment rules at work. We could have used std430 instead to avoid this.

Since we have 1024 bytes for the fragments[] unsized array and we are pushing a vec4 (16 bytes) worth of data to it with every fragment we process then we have enough room for 64 fragments. It is the developer’s responsibility to ensure that this limit is not surpassed, otherwise we would write beyond the allocated space for our buffer and the results would be undefined.

The next step is to do some rendering so we get our shaders to work. That would trigger the execution of our fragment shader for each fragment produced, which will generate writes into our buffer for each buffer variable the shader code writes to. After rendering, we can map the buffer and read its contents from the GL application as usual:

SSBO *ptr = (SSBO *) glMapNamedBuffer(buf, GL_READ_ONLY);

/* List of fragments recorded in the unsized array */
printf("%d fragments recorded:\n", ptr->total);
float *coords = (float *) ptr->fragments;
for (int i = 0; i < ptr->total; i++, coords +=4) {
printf("Fragment %d: (%.1f, %.1f, %.1f, %.1f)\n",
i, coords[0], coords[1], coords[2], coords[3]);
}

/* First fragment for each index used */
for (int i = 0; i < 8; i++) {
if (ptr->counter[i*4] > 0)
printf("First fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
i, ptr->first[i*4], ptr->first[i*4+1],
ptr->first[i*4+2], ptr->first[i*4+3]);
}

/* Last fragment for each index used */
for (int i = 0; i < 8; i++) {
if (ptr->counter[i*4] > 1)
printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
i, ptr->last[i*4], ptr->last[i*4+1],
ptr->last[i*4+2], ptr->last[i*4+3]);
else if (ptr->counter[i*4] == 1)
printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
i, ptr->first[i*4], ptr->first[i*4+1],
ptr->first[i*4+2], ptr->first[i*4+3]);
}

/* Fragment counts for each index */
for (int i = 0; i < 8; i++) {
if (ptr->counter[i*4] > 0)
printf("Fragment count at index %d: %d\n", i, ptr->counter[i*4]);
}
glUnmapNamedBuffer(buf);


I get this result for an execution where I am drawing a handful of points:

4 fragments recorded:
Fragment 0: (199.5, 150.5, 0.5, 1.0)
Fragment 1: (39.5, 150.5, 0.5, 1.0)
Fragment 2: (79.5, 150.5, 0.5, 1.0)
Fragment 3: (139.5, 150.5, 0.5, 1.0)

First fragment for index 3: (139.5, 150.5, 0.5, 1.0)
First fragment for index 7: (39.5, 150.5, 0.5, 1.0)

Last fragment for index 3: (139.5, 150.5, 0.5, 1.0)
Last fragment for index 7: (79.5, 150.5, 0.5, 1.0)

Fragment count at index 3: 1
Fragment count at index 7: 3


It recorded 4 fragments that the shader mapped to indices 3 and 7. Multiple fragments where assigned to index 7 but we could handle that gracefully by using the corresponding atomic functions. Different executions of the same program will produce the same 4 fragments and map them to the same indices, but the first and last fragments recorded for index 7 can change between executions.

Also notice that the first fragment we recorded in the unsized array (fragments[0]) is not the first fragment recorded for index 7 (fragments[1]). That means that the execution of fragments[0] got first to the unsized array addition code, but the execution of fragments[1] beat it in the race to execute the code that handled the assignment to the first/last arrays, making clear that we cannot make any assumptions regarding the execution order of reads and writes coming from different instances of the same shader execution.

So that’s it, the patches are now in the mesa-dev mailing list undergoing review and will hopefully land soon, so look forward to it! Also, if you have any interesting uses for this new feature, let me know in the comments.

 May 12, 2015

A couple of months ago, I was meeting colleagues of mine working on Docker and discussing about how much effort it would be to add support for it to SUSE OpenStack Cloud. It's been something that had been requested for a long time by quite a number of people and we never really had time to look into it. To find out how difficult it would be, I started looking at it on the evening; the README confirmed it shouldn't be too hard. But of course, we use Crowbar as our deployment framework, and the manual way of setting it up is not really something we'd want to recommend. Now would it be "not too hard" or just "easy"? There was only way to know that... And guess what happened next?

It took a couple of hours (and two patches) to get this working, including the time for packaging the missing dependencies and for testing. That's one of the nice things we benefit from using Crowbar: adding new features like this is relatively straight-forward, and so we can enable people to deploy a full cloud with all of these nice small features, without requiring them to learn about all the technologies and how to deploy them. Of course this was just a first pass (using the Juno code, btw).

Fast-forward a bit, and we decided to integrate this work. Since it was not a simple proof of concept anymore, we went ahead with some more serious testing. This resulted in us backporting patches for the Juno branch, but also making Nova behave a bit better since it wasn't aware of Docker as an hypervisor. This last point is a major problem if people want to use Docker as well as KVM, Xen, VMware or Hyper-V — the multi-hypervisor support is something that really matters to us, and this issue was actually the first one that got reported to us ;-) To validate all our work, we of course asked tempest to help us and the results are pretty good (we still have some failures, but they're related to missing features like volume support).

All in all, the integration went really smoothly :-)

Oh, I forgot to mention: there's also a docker plugin for heat. It's now available with our heat packages now in the Build Service as openstack-heat-plugin-heat_docker (Kilo, Juno); I haven't played with it yet, but this post should be a good start for anyone who's curious about this plugin.

 May 11, 2015

I've recently been contacted by Johannes Hubertz, who is writing a new book about Python in German called "Softwaretests mit Python" which will be published by Open Source Press, Munich this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version.

## How did you come to Python?

I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system.

As soon as I found an idea to work on – if I remember correctly that was rebuildd – I started to code in Python, learning the language at the same time.

I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack.

OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it.

That's what pushed me to write The Hacker's Guide to Python in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python.

It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure!

## Zen of Python: Which line is the most important for you and why?

I like the "There should be one – and preferably only one – obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that.

## For a python newbie, what are the most difficult subjects in Python?

I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield).

Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately.

## When did you start using Test Driven Development and why?

I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs – that's the only thing you do at school.

Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs… I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again.

For a few years, I wrote less Python and more C code and Lua (for the awesome window manager), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions – that was a good lesson. Though I had good excuses at that time – it is/was way harder to do testing in C/Lua than in Python.

Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases.

I honestly don't think I could work on any project that does not have – at least a minimal – test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response. Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything!

In the end, it's just too much frustration to work on non tested projects as I demonstrated in my study of whisper source code.

## What do you think to be the most often seen pitfalls of TDD and how to avoid them best?

The biggest problems are when and at what rate writing tests.

On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage.

Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests.

Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it.

## Do you think, TDD fits and scales well for the big projects like OpenStack?

Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects.

When unit and functional tests coverage was weak in OpenStack – at its beginning – it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 – but fixed in N-1 – were reopened.

For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

 May 08, 2015

I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose…

Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It’s meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example^w^w^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.

# Introduction

Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I’d not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we’re asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requests

# A brief explanation on frequencies in i915

The term that is used within the kernel driver and docs is, “RPS” which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don’t know how CPU P-States work, so I’m not sure how much else is similar.

There are roughly 4 generations of RPS:

1. IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can’t speak much about it. It stopped being a thing after Ironlake (Gen5). I don’t care to look, but you can if you want: drivers/platform/x86/intel_ips.c
2. RPS (Sandybridge, Ivybridge)
There are 4 numbers of interest: RP0, RP1, RPn, “hw max”. The first 3 are directly read from a register

rp_state_cap = I915_READ(GEN6_RP_STATE_CAP);
dev_priv->rps.rp0_freq          = (rp_state_cap >>  0) & 0xff;
dev_priv->rps.rp1_freq          = (rp_state_cap >>  8) & 0xff;
dev_priv->rps.min_freq          = (rp_state_cap >> 16) & 0xff;


RP0 is the maximum value the driver can request from the firmware. It’s the highest non-overclocked frequency supported.
RPn is the minimum value the driver can request from the firmware. It’s the lowest frequency supported.
RP1 is the most efficient frequency.
hw_max is RP0 if the system does not supports overclocking.
Otherwise, it is read through a special set of commands where we’re told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.

3. RPS (HSW+)
Similar to the previous RPS, except there is an RPe (for efficient). I don’t know how it differs from RP1. I just learned about this after posting the tool – but just be aware it’s different. Baytrail and Cherryview also have a concept of RPe.
4. The Atom based stuff (Baytrail, Cherryview)
I don’t pretend to know how this works [or doesn’t]. They use similar terms.
They have the same set of names as the HSW+ RPS.

# How the driver uses it

The driver can make requests to the firmware for a desired frequency:

if (IS_HASWELL(dev) || IS_BROADWELL(dev))
I915_WRITE(GEN6_RPNSWREQ, HSW_FREQUENCY(val));
else
I915_WRITE(GEN6_RPNSWREQ, GEN6_FREQUENCY(val) |
GEN6_OFFSET(0) |
GEN6_AGGRESSIVE_TURBO);


This register interface doesn’t provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.

By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system’s busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:

render_count = I915_READ(VLV_RENDER_C0_COUNT_REG);


In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.

It seems obvious that there are usual trends, if you’re incrementing the frequency you’re more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can’t talk knowledgeably about it – just realize it’s not as naive as it once was, and should do a better job as a result.

The flow seems a bit ping-pongish:

The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there’s nothing going on in the system, that should provide significant power savings.

# Why you shouldn’t set –max or –min

First, what does it do? –max and –min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn’t check what this does when we’re using software to determine busyness, but it should be very similar

I should mention now that this is what I’ve inferred through careful observation^w^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it’s previously acknowledges it can give a specific frequency.

As an example if you do:

intel_gpu_frequency --set X
assert (intel_gpu_frequency --get == X)


There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to –max, the GPU is more likely to hit these overheating conditions and throttle you down. So –max doesn’t even necessarily get you max. It’s sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we’d get into the same spot, but I’m not really sure. Perhaps the firmware won’t tell us to throttle up so aggressively when it’s near its thermal limits. Using –max can actually result in non-optimal performance, and I have some good theories why, but I’ll keep them to myself since I really have no proof.

–min on the other hand is equally stupid for a different and more obvious reason. As I said above, it’s guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.

# What you should do with it

The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.

min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don’t actually expect anyone to use this very much.

## How not to get shot in the foot

If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run torun making tweaks, I’d recommend setting something maybe 75% or so of max – that is a total ballpark figure. When you’re ready to get some absolute numbers, you can try setting –max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.

The sysfs entries require root for a reason. Any time

# Footnotes

On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I’d assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don’t idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it’s a number selected after a whole lot of tests run – we could ignore it.

Upstreaming requirements for the DRM subsystem are a bit special since Dave Airlie requires a full-blown open-source implementation as a demonstration vehicle for any new interfaces added. I've figured it's better to clear this up once instead of dealing with the fallout from surprises and made a few slides for a training session. Dave reviewed and acked them, hence this should be the up-to-date rules - the old mails and blogs back from when some ARM SoC vendors tried to push drm drivers for blob userspace to upstream are a bit outdated.

Any here's the slides for my gfx kernel upstreaming requirements training.
 May 07, 2015

So I composed an email today to the Fedora-desktop mailing list to summarize the feedback we got here on the blog post on
my request for reasons people where not switching to Fedora. Thought I should share it here too for easier access for the
wider community and for the commentators to see that I did take the time to go through and try to summarize the posts.
Of course feel free to comment on this blog if you think I missed something important or if there are other major items we
should be thinking about. We will be discussing this both on the mailing list, in our Workstation working group meetings and at Flock in August. Anyway, I will let you go on to read the email (you can find the thread here if interested.

************************
A couple of weeks ago I blogged about who Fedora Workstation is an integrated system, but also asking for
feedback for why people are not migrating to Fedora Workstation, especially asking about why people would be using
GNOME 3 on another distro. So I got about 140 comments on that post so I thought I should write up a summary and
post here. There was of course a lot of things mentioned, but I will try to keep this summary to what I picked up
as the recurring topics.

So while this of course is a poll consisting of self selected commentators I still think the sample is big enough that we
should take the feedback into serious consideration for our plans going forward. Some of them I even think are already
handled by underway efforts.

Quite a few people mentioned this, ranging from those who wanted to switch us to a rolling release, a tick/tock
release style, to just long release cycles. Probably more people saying they thought the current 6 Month cycle
was just to harrowing than people who wanted rolling releases or tick/tock releases.

3rd Party Software
This was the single most brought up item. With people saying that they stayed on other distros due to the pain of
getting 3rd party software on Fedora. This ranged from drivers (NVidia, Wi-Fi), to media codecs to end user
applications. Width of software available in general was also brought up quite a few times. If anyone is in any doubt
that our current policy here is costing us users I think these comments clearly demonstrates otherwise.

Optimus support
Quite a few people did bring up that our Optimus support wasn’t great. Luckily I know Bastien Nocera is working on
something there based on work by Dave Arlie, so hopefully this is one we can check off soon.

Many people also pointed out that we had no UI for upgrading Fedora.

HiDPI issues
A few comments on various challenges people have with HiDPI screens, especially when dealing with non-GTK3 apps-

Multimonitor support
A few comments that our multimonitor support could be better

SELinux is a pain
A few comments about SELinux still getting in the way at times

Better Android integration
A few people asked for more/better Android device integration features

Built in backup solution
A few people requested we create some kind of integrated backup solution

Also a few concrete requests in terms of applications for Fedora:
http://www.mixxx.org
http://www.vocalproject.net
https://gnumdk.github.io/lollypop/
http://peterlevi.com/variety/
http://foldercolor.tuxfamily.org
choqok for GNOME (microblogging client)

This post is a reaction to http://blog.rongarret.info/2015/05/why-lisp.html and I could not think of a better title.

Before I get into my thoughts, I'd like to quickly recap two arguments presented by two different studiers of communication, as they are essential to my premise.

The first argument is known as "the medium is the message," from Marshall McLuhan. It is, in essence, the concept that the medium in which a message is conveyed is part of the message. "Medium" here is the singular of "media."

The key realization from McLuhan is the idea that, since the message is inalienably embedded in its medium, the process of interpreting the medium is intertwined with interpreting the message. A message restated in a different medium is necessarily a different message from the original. It also means that the ability of a consumer to understand the message is connected to their ability to understand the medium.

The second argument has no nifty name that I know of. It is Hofstadter's assertion, from GEB, that messages are decomposable into (at least) three pieces, which he calls the frame, outer message, and inner message. The frame indicates that a message is a message. The outer message explains how the inner message is structured. The inner message is freeform or formless.

What I'm taking from Hofstadter's model is the idea that, although the frame can identify a message, neither the frame nor the outer message can actually explain how the inner message is to be understood. The GEB examples are all linguistic and describe how no part of a message can explain which language was used to craft it without using the language in question. The outer message is the recognition that the inner message exists and is in a certain language.

I can unify these two ideas. I bet that you can, too. McLuhan's medium is Hofstadter's frame and outer message. A message is not just data, but also the metadata which surrounds it.

Can we talk about Lisp? Not quite. I have one more thing. Consider two people speaking English to each other. We have a medium of speech, a frame of rhythmic audio, and an outer message of English speech. They can understand each other, right? Well, not quite. They might not speak exactly the same dialect of English. They have different life experiences, which means that their internal dictionaries, idioms, figures of speech, and so forth might not line up precisely. As an example, our two speakers might disagree on whether "raspberry" refers to fruit or flatulence. Worse, they might disagree on whether "raspberry" refers to berries! These differences constantly occur as people exchange messages.

Our speakers can certainly negotiate, word by word, any contention, as long as they have some basic words in common. However, this might not be sufficient, in the case of some extremely different English dialects like American English and Scottish English. It also might not be necessary; consider the trope of two foreigners without a common language coming together to barter or trade, or the still-not-understood process of linguistic bootstrapping which an infant performs. Even if the inner message isn't decodable, the outer message and frame can still carry some of the content of the message (since the medium is the message!) and communication can still happen in some limited fashion.

I'll now point out that the idea that "normal" communication is not "limited" is illusory; this sort of impedance mismatch occurs during transmission and receipt of every message, since the precise concepts that a person carries with them are ever-changing. A person might not even be able to communicate effectively with themselves; many people have the surreal, uncanny experience of reading a note which they have written to themselves only a day or two ago, and not understanding what the note was trying to convey.

I would like to now postulate an idea. Since the medium is the message and the deciphering and grokking of messages is imperfect, communication is not just about deciphering messages, but also about discovering the meaning of the message via knowledge of the medium and messenger. To put it shortly and informally, communication is about discovering how others talk, not just about talking.

Okay! Now we're ready to tackle Lisp. What does Lisp source look like? Well, it's lots of lists, which contain other lists, and some atoms, and some numbers. Or, at least, that's what a Lisper sees. Most non-Lispers see parentheses. Lots of parentheses. Lispers like to talk about how the parentheses eventually fade away, leaving pure syntax, metasyntax, metapatterns, and so forth. The language has a simple core, and is readily extended by macros, which are also Lisp code.

I'll get to macros in a second. First, I want to deal with something from the original article. "Think about it: to represent hierarchical data you need two syntactic elements: a token separator and a block delimiter. In S expressions, whitespace is the token separator and parens are the block delimiters. That's it. You can't get more minimal than that." I am obligated to disagree. English, the language of the article, has conventions for semantic blocks, but they are often ignored, omitted, etc. I speak Lojban, which at its most formal has no delimiters nor token separators which are readable to the human eye. Placing spaces in Lojban text is a courtesy to human readers but not needed by computers. Another programming language, Forth, has no block delimiters. It also lacks blocks. Also the token separator is, again, a courtesy; Forth's lexer often reinterprets or ignores whitespace when asked.

In fact, Lisp's syntax is, well, syntax. While relatively simple when compared to many languages, Lisp is still structured and retains syntax that would be completely superfluous were one to speak in the language of digital logic which computers normally use. To use the language from earlier in this post, Lisp's syntax is part of its framing; we identify Lisp code precisely by the preponderance of parentheses. The opening parenthesis signals to both humans and computers alike that a Lisp list is starting.

The article talks about the power to interleave data and code. Code operates on data. Code which is also data can be operated on by metacode, which is also code. A strange loop forms. This is not unlike Hofstadter's suggestion that an inner message could itself contain another outer and inner message. English provides us with ample opportunities to form and observe such loops. Consider this quote: "Within these quotation marks, a new message is possible, which can extend beyond its limits and be interpreted differently from the rest of the message: 'Within these quotation marks, a new message is possible.'" I've decided to not put any nasty logic puzzles into this post, but if I had done so, this would be the spot. Mixing metalevels can be hard.

I won't talk about quoting any longer, other than to note that it's not especially interesting if a system supports quoting. There is a proof that quines are possible in any sufficiently complex (Turing complete, for example) language.

Macros are a kind of code-as-data system which involves reflection and reification, transforming code into data and operating on it before turning the data back into code. In particular, this permits code to generate code. This isn't a good or bad thing. In fact, similar systems are seen across the wider programming ecosystem. C has macros, C++ has templates, Haskell has Template Haskell, object-based systems like Python, Ruby, and JS have metaclass or metaprototype faculties.

Lispers loudly proclaim that macros benefit the expressive power of their code. By adding macros to Lisp code, the code becomes more expressive as it takes on deeper metalevels. This is not unlike the expressive power that code gains as it is factored and changed to be more generic. However, it comes at a cost; macros create dialects.

This "dialect" label could be borrowed from the first part of this post, where I used it to talk about spoken languages. Lispers use it to talk about different Lisps which have different macros, different builtin special forms, etc. Dialects create specialization, both for the code and for those reading and writing the code. This specialization is a direct result of the macros that are incorporated into the dialect.

I should stress that I am not saying that macros are bad. This metapower is neither Good nor Bad, in terms of Quality; it is merely different. Lisp is an environment where macros are accepted. Forth is another such environment; the creator of Forth anticipated that most Forth code would not be ANS FORTH but instead be customized heavily for "the task at hand." A Forth machine's dictionary should be full of words which were created by the programmer specifically for the programmer's needs and desires.

Dialects are evident in many other environments. Besides Forth and Lisp, dialects can be seen in large C++ and Java codebases, where tooling and support libraries make up non-trivial portions of applications. Haskellers are often heard complaining about lenses and Template Haskell magic. Pythonistas tell horror stories of Django, web2py, and Twisted. It's not enough to know a language to be effective in these codebases; it's often necessary to know the precise dialect being used. Without the dialect, a programmer has to infer more meaning from the message; they have to put more effort into deciphering and grokking.

"Surely macros and functions are alike in this regard," you might say, if you were my inner voice. And you would be somewhat right, in that macros and functions are both code. The difference is that a macro is metacode; it is code which operates on code-as-data. This necessarily means that usage of a macro makes every reader of the code change how they interpret the code; a reader must either internalize the macro, adding it to their dictionary, or else reinterpret every usage of the macro in terms of concepts that they already know. (I am expecting a sage nod from you, reader, as you reflect upon instances in your past when you first learned a new word or phrase!) Or, to put it in the terms of yore, the medium is the message, and macros are part of the medium of the inner message. The level of understanding of the reader is improved when they know the macros already!

How can we apply this to improve readability of code? For starters, consider writing code in such a way that quotations and macro applications are explicit, and that it is obvious which macros are being applied to quotations. To quote a great programmer, "Namespaces are one honking great idea." Anything that helps isolate and clarify metacode is good for readability.

Since I'm obligated to mention Monte, I'm going to point out that Monte has a very clever and simple way to write metacode: Monte requires metacode to be called with special quotation marks, and also to be annotated with the name of the dialect to use. This applies to regular expressions, parser generators, XML and JSON, etc.; if a new message is to be embedded inside Monte code, and it should be recognized as such, then Monte provides a metacode system for doing precisely that action. In Monte, this system is called the quasiliteral system, and it is very much like quasiliteral quoting within Lisp, with the two differences I just mentioned: Special quotation marks and a dialect annotation.

I think that that's about the end of this particular ramble. Thanks.

 May 04, 2015

A year passed since the first release of The Hacker's Guide to Python in March 2014. A few hundreds copies have been distributed so far, and the feedback is wonderful!

I already wrote extensively about the making of that book last year, and I cannot emphasize enough how this adventure has been amazing so far. That's why I decided a few months ago to update the guide and add some new content.

So let's talk about what's new in this second edition of the book!

First, I obviously fixed a few things. I had some reports about small mistakes and typos which I applied as I received them. Not a lot fortunately, but it's still better to have fewer errors in a book, right?

Then, I updated some of the content. Things changed since I wrote the first chapters of that guide 18 months ago. Therefore I had to rewrite some of the sections and take into account new software or libraries that were released.

At last, I decided to enhance the book with one more interview. I've requested my fellow OpenStack developer Joshua Harlow, who is leading a few interesting Python projects, to join the long list of interviewees in the book. I hope you'll enjoy it!

If you didn't get the book yet, go check it out and use the coupon THGTP2LAUNCH to get 20% off during the next 48 hours!

 April 27, 2015

Just like they did for Debian developers before, it is Valve’s way of saying thanks and giving something back to the community. This is great news for all Mesa contributors, now we can play some great Valve games for free and we can also have an easier time looking into bug reports for them, which also works great for Valve, closing a perfect circle

 April 23, 2015

I just wanted to say that you to everyone who commented and provided Feedback on my last blog entry where I asked for feedback on what would make you switch to Fedora Workstation. We will take all that feedback and incorporate it into our Fedora Workstation 23 planning. So a big thanks to all of you!

 April 22, 2015

I've had a half-broken temperature monitoring setup at home for quite some time. It started out with a Atom-based NAS, a USB-serial adapter and a passive 1-wire adapter. It sometimes worked, then stopped working, then started when poked with a stick. Later, the NAS was moved under the stairs and I put a Beaglebone Black in its old place. The temperature monitoring thereafter never really worked, but I didn't have the time to fix it. Over the last few days, I've managed to get it working again, of course by replacing nearly all the existing components.

I'm using the DS18B20 sensors. They're about USD 1 a piece on Ebay (when buying small quantities) and seems to work quite ok.

My first task was to address the reliability problems: Dropouts and really poor performance. I thought the passive adapter was problematic, in particular with the wire lengths I'm using and I therefore wanted to replace it with something else. The BBB has GPIO support, and various blog posts suggested using that. However, I'm running Debian on my BBB which doesn't have support for DTB overrides, so I needed to patch the kernel DTB. (Apparently, DTB overrides are landing upstream, but obviously not in time for Jessie.)

I've never even looked at Device Tree before, but the structure was reasonably simple and with a sample override from bonebrews it was easy enough to come up with my patch. This uses pin 11 (yes, 11, not 13, read the bonebrews article for explanation on the numbering) on the P8 block. This needs to be compiled into a .dtb. I found the easiest way was just to drop the patched .dts into an unpacked kernel tree and then running make dtbs.

Once this works, you need to compile the w1-gpio kernel module, since Debian hasn't yet enabled that. Run make menuconfig, find it under "Device drivers", "1-wire", "1-wire bus master", build it as a module. I then had to build a full kernel to get the symversions right, then build the modules. I think there is or should be an easier way to do that, but as I cross-built it on a fast AMD64 machine, I didn't investigate too much.

Insmod-ing w1-gpio then works, but for me, it failed to detect any sensors. Reading the data sheet, it looked like a pull-up resistor on the data line was needed. I had enabled the internal pull-up, but apparently that wasn't enough, so I added a 4.7kOhm resistor between pin 3 (VDD_3V3) on P9 and pin (GPIO_45) on P8. With that in place, my sensors showed up in /sys/bus/w1/devices and you can read the values using cat.

In my case, I wanted the data to go into collectd and then to graphite. I first tried using an Exec plugin, but never got it to work properly. Using a [python plugin] worked much better and my graphite installation is now showing me temperatures.

Now I just need to add more probes around the house.

The most useful references were

In addition, various searches for DS18B20 pinout and similar, of course.

 April 21, 2015

Got a talk coming up? Want it to go well? Here’s some starting points.

I give a lot of talks. Often I’m paid to give them, and I regularly get very high ratings or even awards. But every time I listen to people speaking in public for the first time, or maybe the first few times, I think of some very easy ways for them to vastly improve their talks.

Here, I wanted to share my top tips to make your life (and, selfishly, my life watching your talks) much better:

1. Presenter mode is the greatest invention ever. Use it. If you ignore or forget everything else in this post, remember the rainbows and unicorns of presenter mode. This magical invention keeps the current slide showing on the projector while your laptop shows something different — the current slide, a small image of the next slide, and your slide notes. The last bit is the key. What I put on my notes is the main points of the current slide, followed by my transition to the next slide. Presentations look a lot more natural when you say the transition before you move to the next slide rather than after. More than anything else, presenter mode dramatically cut down on my prep time, because suddenly I no longer had to rehearse. I had seamless, invisible crib notes while I was up on stage.
2. Plan your intro. Starting strong goes a long way, as it turns out that making a good first impression actually matters. It’s time very well spent to literally script your first few sentences. It helps you get the flow going and get comfortable, so you can really focus on what you’re saying instead of how nervous you are. Avoid jokes unless most of your friends think you’re funny almost all the time. (Hint: they don’t, and you aren’t.)
3. No bullet points. Ever. (Unless you’re an expert, and you probably aren’t.) We’ve been trained by too many years of boring, sleep-inducing PowerPoint presentations that bullet points equal naptime. Remember presenter mode? Put the bullet points in the slide notes that only you see. If for some reason you think you’re the sole exception to this, at a minimum use visual advances/transitions. (And the only good transition is an instant appear. None of that fading crap.) That makes each point appear on-demand rather than all of them showing up at once.
4. Avoid text-filled slides. When you put a bunch of text in slides, people inevitably read it. And they read it at a different pace than you’re reading it. Because you probably are reading it, which is incredibly boring to listen to. The two different paces mean they can’t really focus on either the words on the slide or the words coming out of your mouth, and your attendees consequently leave having learned less than either of those options alone would’ve left them with.
5. Use lots of really large images. Each slide should be a single concept with very little text, and images are a good way to force yourself to do so. Unless there’s a very good reason, your images should be full-bleed. That means they go past the edges of the slide on all sides. My favorite place to find images is a Flickr advanced search for Creative Commons licenses. Google also has this capability within Search Tools. Sometimes images are code samples, and that’s fine as long as you remember to illustrate only one concept — highlight the important part.
6. Look natural. Get out from behind the podium, so you don’t look like a statue or give the classic podium death-grip (one hand on each side). You’ll want to pick up a wireless slide advancer and make sure you have a wireless lavalier mic, so you can wander around the stage. Remember to work your way back regularly to check on your slide notes, unless you’re fortunate enough to have them on extra monitors around the stage. Talk to a few people in the audience beforehand, if possible, to get yourself comfortable and get a few anecdotes of why people are there and what their background is.
7. Don’t go over time. You can go under, even a lot under, and that’s OK. One of the best talks I ever gave took 22 minutes of a 45-minute slot, and the rest filled up with Q&A. Nobody’s going to mind at all if you use up 30 minutes of that slot, but cutting into their bathroom or coffee break, on the other hand, is incredibly disrespectful to every attendee. This is what watches, and the timer in presenter mode, and clocks, are for. If you don’t have any of those, ask a friend or make a new friend in the front row.
8. You’re the centerpiece. The slides are a prop. If people are always looking at the slides rather than you, chances are you’ve made a mistake. Remember, the focus should be on you, the speaker. If they’re only watching the slides, why didn’t you just post a link to Slideshare or Speakerdeck and call it a day?

I’ve given enough talks that I have a good feel on how long my slides will take, and I’m able to adjust on the fly. But if you aren’t sure of that, it might make sense to rehearse. I generally don’t rehearse, because after all, this is the lazy way.

If you can manage to do those 8 things, you’ve already come a long way. Good luck!

Tagged: communication, gentoo

A few months ago, I wrote a long post about what I called back then the "Gnocchi experiment". Time passed and we – me and the rest of the Gnocchi team – continued to work on that project, finalizing it.

It's with a great pleasure that we are going to release our first 1.0 version this month, roughly at the same time that the integrated OpenStack projects release their Kilo milestone. The first release candidate numbered 1.0.0rc1 has been released this morning!

# The problem to solve

Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve.

Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics).

And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks… which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage.

What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes.

Gnocchi is fully documented and the documentation is available online. We are the first OpenStack project to require patches to integrate the documentation. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests.

I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi!

# Handling metrics

Gnocchi provides a full REST API to manipulate time-series that are called metrics. You can easily create a metric using a simple HTTP request:

POST /v1/metric HTTP/1.1Content-Type: application/json {  "archive_policy_name": "low"} HTTP/1.1 201 CreatedLocation: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46aContent-Type: application/json; charset=UTF-8 {  "archive_policy": {    "aggregation_methods": [      "std",      "sum",      "mean",      "count",      "max",      "median",      "min",      "95pct"    ],    "back_window": 0,    "definition": [      {        "granularity": "0:00:01",        "points": 3600,        "timespan": "1:00:00"      },      {        "granularity": "0:30:00",        "points": 48,        "timespan": "1 day, 0:00:00"      }    ],    "name": "low"  },  "created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01",  "created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a",  "id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a",  "name": null,  "resource_id": null}

The archive_policy_name parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the low archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, … and even 95th percentile. All of that is obviously customizable and you can create your own archive policies.

If you don't want to specify the archive policy manually for each metric, you can also create archive policy rule, that will apply a specific archive policy based on the metric name, e.g. metrics matching disk.* will be high resolution metrics so they will use the high archive policy.

It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see back_window parameter).

It's then possible to send measures to this metric:

POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1Content-Type: application/json [  {    "timestamp": "2014-10-06T14:33:57",    "value": 43.1  },  {    "timestamp": "2014-10-06T14:34:12",    "value": 12  },  {    "timestamp": "2014-10-06T14:34:20",    "value": 2  }  ]  HTTP/1.1 204 No Content

These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on Swift or Ceph which are both scalable storage objects systems.

It's then possible to retrieve these values:

GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1 HTTP/1.1 200 OKContent-Type: application/json; charset=UTF-8 [  [    "2014-10-06T14:30:00.000000Z",    1800.0,    19.033333333333335  ],  [    "2014-10-06T14:33:57.000000Z",    1.0,    43.1  ],  [    "2014-10-06T14:34:12.000000Z",    1.0,    12.0  ],  [    "2014-10-06T14:34:20.000000Z",    1.0,    2.0  ]]

As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore.

By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the aggregation, start and stop query parameter.

Gnocchi also supports doing aggregation across aggregated metrics:

GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1 HTTP/1.1 200 OKContent-Type: application/json; charset=UTF-8 [  [    "2014-10-06T14:34:12.000000Z",    1.0,    12.25  ],  [    "2014-10-06T14:34:20.000000Z",    1.0,    11.6  ]]

This computes the mean of mean for the metric 65071775-52a8-4d2e-abb3-1377c2fe5c55 and 9ccdd0d6-f56a-4bba-93dc-154980b6e69a starting on 6th October 2014 at 14:34 UTC.

Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called generic, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as instance, volume, network or even image.

POST /v1/resource/generic HTTP/1.1 Content-Type: application/json {  "id": "75C44741-CC60-4033-804E-2D3098C7D2E9",  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"} HTTP/1.1 201 CreatedLocation: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"Last-Modified: Fri, 17 Apr 2015 11:18:48 GMTContent-Type: application/json; charset=UTF-8 {  "created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a",  "created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742",  "ended_at": null,  "id": "75c44741-cc60-4033-804e-2d3098c7d2e9",  "metrics": {},  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",  "revision_end": null,  "revision_start": "2015-04-17T11:18:48.696288Z",  "started_at": "2015-04-17T11:18:48.696275Z",  "type": "generic",  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"}

The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the revision_start and revision_end fields are for. They indicates the lifetime of this revision of the resource. The ETag and Last-Modified headers are also unique to this resource revision and can be used in a subsequent request using If-Match or If-Not-Match header, for example:

GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd" HTTP/1.1 304 Not Modified

Which is useful to synchronize and update any view of the resources you might have in your application.

You can use the PATCH HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously.

The metrics properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically:

POST /v1/resource/generic HTTP/1.1Content-Type: application/json {  "id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB",  "metrics": {    "temperature": {      "archive_policy_name": "low"    }  },  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"} HTTP/1.1 201 CreatedLocation: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcbETag: "9f64c8890989565514eb50c5517ff01816d12ff6"Last-Modified: Fri, 17 Apr 2015 14:39:22 GMTContent-Type: application/json; charset=UTF-8 {  "created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad",  "created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb",  "ended_at": null,  "id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb",  "metrics": {    "temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0"  },  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",  "revision_end": null,  "revision_start": "2015-04-17T14:39:22.181615Z",  "started_at": "2015-04-17T14:39:22.181601Z",  "type": "generic",  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"}

# Haystack, needle? Find!

With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in!

You can search for a resource based on any field, for example:

POST /v1/search/resource/instance HTTP/1.1Content-Type: application/json {  "=": {    "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"  }}

That query will return a list of all resources owned by the user_id bd3a1e52-1c62-44cb-bf04-660bd88cd74d.

You can do fancier queries such as retrieving all the instances started by a user this month:

POST /v1/search/resource/instance HTTP/1.1Content-Type: application/jsonContent-Length: 113 {  "and": [    {      "=": {        "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"      }    },    {      ">=": {        "started_at": "2015-04-01"      }    }  ]}

And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host foobar the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history!

POST /v1/search/resource/instance?history=true HTTP/1.1Content-Type: application/jsonContent-Length: 113 {  "and": [    {      "=": {        "host": "foobar"      }    },    {      ">=": {        "lifespan": "1 hour"      }    },    {      "<=": {        "revision_start": "2015-04-15"      }    }   ]}

I could also mention the fact that you can search for value in metrics. One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month.

# Cherries on the cake

While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy.

Gnocchi also implements a full RBAC system based on the OpenStack standard oslo.policy and which allows pretty fine grained control of permissions.

There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price!

Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack Heat also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms.

And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the full documentation!

# Towards Gnocchi 1.1!

Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components – probably once every 2 months or the like.

What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer).

Stay tuned, and happy hacking!

 April 20, 2015

So I came across this very positive review of Fedora Workstation on linux.com, although it was billed as a review of GNOME 3.16. Which is of course correct, in the sense that the desktop we are going to ship on Fedora Workstation 22 is GNOME 3.16. But I wanted to take the opportunity to highlight that when you look at Fedora Workstation it is a complete and integrated operating system. As I mentioned in a blog post about Fedora Workstation last April the core idea for Fedora Workstation is to stop treating the operating system as a bag of parts and instead look at it as a whole, because if we are to drain the swamp here we need to solve the major issues that people face with their desktop regardless of if that issue is in the kernel, the graphics drivers, glibc or elsewhere in your system. We can not just look at the small subset of packages that provides the chrome for your user interface in isolation.

This is why we are working on reliable firmware upgrades for your UEFI motherboard by participating in the UEFI working group and adding functionality in GNOME Software to handle doing firmware updates.

This is why we recently joined Khronos to make sure the standards for doing 3D on Linux are good and open source friendly.

This is why we been working so hard on improving coverage of Appdata metadata coverage, well beyond the confines of ‘GNOME’ software.

This is why we have Richard Hughes and Owen Taylor working on how we can improve battery life when running
Fedora or RHEL on laptops.

This is why we created dnf to replace yum, to get a fast and efficient package update system.

This is why we are working on an Adwaita theme for Qt

And this is why we are pushing hard forward with a lot of other efforts like Wayland, libinput, Fleet Commander, Boxes and more.

So when you look at the user experience you get on Fedora Workstation, remember that it is not just a question of which version of GNOME we are shipping, but it is the fact that we are working hard on putting together a tightly vertically integrated and tested system from the kernel up to core desktop applications.

Anyone who has been using Fedora for a long while knows that this change was a major change in philosophy and approach for the project, as Fedora up to the 21 release of the 3 new products was very much defined by the opposite, being all about the lego blocks, which contributed to the image of Fedora being a bleeding edge system where you should be prepared to do a lot of bleeding and where you probably wanted to keep your toolbox with you at all times in case something broke. So I have to say that I am mightily impressed by how the Fedora community has taken to this major change where we now are instead focusing our efforts on our 3 core products and are putting a lot of effort into creating stuff that is polished and reliable, and which aims to be leading edge instead of bleeding edge.

So with all this in mind I was a little disappointed when the reviewer writing the article in question ended his review by saying he was now waiting for GNOME 3.16 to appear in Ubuntu GNOME, because there is no guarantees that he would get the same overall user experience in Ubuntu GNOME that we have developed for Fedora Workstation, which is the user experience his review reflects.

Anyway, I thought this could be a good opportunity to actually ask the wider community a question, especially if you are using GNOME on another distribution than Fedora, what are we still missing at this point for you to consider making a switch to Fedora Workstation? I know that for some of you the answer might be as simple as ‘worn in shoes fits the best’, but anything you might have beyond that would be great to hear.
I can’t promise that we will be able to implement every suggestion you add to this blog post, but I do promise that we will review and consider every suggestion you provide and try to see how it can fit into development plans going forward.

With Linux kernel v3.20^W v4.0 already out there door my overview of what's in 4.1 for drm/i915 is way overdue.

First looking at the modeset side of the driver the big overall thing is all the work to convert i915 to atomic. In this release there's code from Ander Conselvan de Oliveira to have a struct drm_atomic_state allocated for all the legacy modeset code paths in the driver. With that we can switch the internals to start using atomic state objects and gradually convert over everything to atomic on the modeset side. Matt Roper on the other hand was busy to prepare the plane code to land the atomic watermark update code. Damien has reworked the initial plane configuration code used for fastboot, which also needs to be adapted to the atomic world.

For more specific feature work there's the DRRS (dynamic refresh rate switching) from Sonika, Vandana and more people, which is now enabled where supported. The idea is to reduce the refresh rate of the panel to save power when nothing changes on the screen. And Paulo Zanoni has provided patches to improve the FBC code, hopefully we can enable that by default soon too. Under the hood Ville has refactored the DP link rate computation and the sprite color key handling, both to prepare for future work and platform enabling. Intermediate link rate support for eDP 1.4 from Sonika built on top of this. Imre Deak has also reworked the Baytrail/Braswell DPLL code to prepare for Broxton.

Speaking of platforms, Skyleigh has gained runtime PM support from Damien, and RPS (render turbo and sleep states) from Akash. Another SKL exclusive is support for scanout of Y-tiled buffers and scanning out buffers rotated by 90°/270° (instead of just normal and rotated by 180°) from Tvrtko and Damien. Well the rotation support didn't quite land yet, but Tvrtko's support for the special pagetable binding needed for that feature in the form of rotated GGTT views. Finally Nick Hoath and Damien also submitted a lot of workaround patches for SKL.

Moving on to Braswell/Cherryview there have been tons of fixes to the DPLL and watermark code from Vijay and Ville, and BSW left the preliminary hw support. And also for the SoC platforms Chris Wilson has supplied a pile of patches to tune the rps code and bring it more in-line with the big core platforms.

On the GT side the big ongoing work is dyanmic pagetable allocations Michel Thierry based upon patches from Ben Widawsky. With per-process address spaces and even more so with the big address spaces gen8+ supports it would be wasteful if not impossible to allocate pagetables for the entire address space upfront. But changing the code to handle another possible memory allocation failure point needed a lot of work. Most of that has landed now, but the benefits of enabling bigger address spaces haven't made it into 4.1.

Another big work is XenGT client-side support fromYu Zhang and team. This is paravirtualization to allow virtual machines to tap into the render engines without requiring exclusive access, but also with a lot less overhead than full virtual hardware like vmware or virgil would provide. The host-side code is also submitted already, but needs a bit more work still to integrate cleanly into the driver.

And of course there's been lots of other smaller work all over, as usual. Internal documentation for the shrinker, more dead UMS code removed, the vblank interrupt code cleaned up and more.
 April 18, 2015

So the Endless Computer kickstarter just succeeded in their funding goal of 100K USD. A big heartfelt congratulations to the whole team and I am looking forward to receiving my system. For everyone else out there I strongly recommend getting in on their kickstarter, not only do you get a cool looking computer with a really nice Linux desktop, you are helping a company forward that has the potential to take the linux dektop to the next level. And the be aware that the computer is a standard computer (yet very cool looking) at the end of the day, so if you want to install Fedora Workstation on it you can

 April 17, 2015

So Red Hat are now formally a member of the Khronos Groups who many of probably know as the shepherds of the OpenGL standard. We haven’t gotten all the little bits sorted yet, like getting our logo on the Khronos website, but our engineers are signing up for the various Khronos working groups etc. as we speak.

So the reason we are joining is because of all the important changes that are happening in Graphics and GPU compute these days and our wish to have more direct input of the direction of some of these technologies. Our efforts are likely to focus on improving the OpenGL specification by proposing some new extensions to OpenGL, and of course providing input and help with moving the new Vulkan standard forward.

So well known Red Hat engineers such as Dave Airlie, Adam Jackson, Rob Clark and others will from now on play a much more direct role in helping shape the future of 3D Graphics standards. And we really look forward to working with our friends and colleagues at Nvidia, AMD, Intel, Valve and more inside Khronos.

 April 16, 2015

I am following the discussion caused by Greg Kroah-Hartman requesting that kdbus be pulled into the next kernel release. First of all my hat of to Greg for his persistence and staying civil. There has already been quite a few posts to the thread at coming close to attempts at character assassination and a lot of emails just adding more noise, but no signal.

One point I feel is missing from the discussion here though is the question of not making the perfect the enemy of the good. A lot of the posters are saying that ‘hey, you should write something perfect here instead of what you have currently’. Which sounds reasonable on some level, but when you think of it is more a deflection strategy than a real suggestion. First of all the is no such thing as perfect. And secondly if there was how long would it take to provide the theoretical perfect thing? 2 more years, 4 years, 10 years?

Speaking as someone involved in making an operating system I can say that we would have liked to have kdbus 5 years ago and would much prefer to get kdbus in the next few Months, than getting something ‘perfect’ in 5 years from now.

So you can say ‘hey, you are creating a strawman here, nobody used the word ‘perfect” in the discussion. Sure, nobody used that word, but a lot of messages was written about how ‘something better’ should be created here instead. Yet, based on from where these people seemed to be coming the question I ask then is: Better for who? Better for the developers who are already using dbus in the applications or desktops? Better for a kernel developer who is never going to use it? Better for someone doing code review? Better for someone who doesn’t need the features of dbus, but who would want something else?

And what is ‘better’ anyway? Greg keeps calling for concrete technical feedback, but at the end of the day there is a lot of cases where the ‘best’ technical solution, to the degree you can measure that objectively, isn’t actually ‘the best’. I mean if I came up with a new format for storing multimedia on an optical disk, one which from a technical perspective is ‘better’ than the current Blu-Ray spec, that doesn’t mean it is actually better for the general public. Getting a non-standard optical disc that will not play in your home Blu-Ray player isn’t better for 99.999% of people, regardless of the technical merit of the on-disc data format.

Something can actually be ‘better’ just because it is based on something that already exists, something which have a lot of people already using it, lets people quickly extend what they already are doing with the new functionality without needing a re-write and something which is available ‘now’ as opposed to some undefined time in the future. And that is where I feel a lot of the debaters on the lkml are dropping the ball in this discussion; they just keeping asking for a ‘better solution’ to the challenges of a space they often don’t have any personal experience with developing in, because kdbus doesn’t conform to how they would implement a kernel IPC mechanism for the kind of problems they are used to trying to solve.

Also there has been a lot of arguments about the ‘design’ of kdbus and dbus. A lot of lacking concreteness to it and mostly coming from people very far removed from working on the desktop and other relevant parts of userspace. Which at the end of the day boils down to trying to make the lithmus test ‘you have to prove to me that making a better design is impossible’, and I think that anyone should be able to agree that if that was the test for adding anything to the Linux kernel or elsewhere then very little software would ever get added anywhere. In fact if we where to hold to that kind of argumentation we might as well leave software development behind and move to an online religion discussion forum tossing the good ol’ “prove to me that God doesn’t exist’ into the ring.

So in the end I hope Linus, who is the final arbiter here, doesn’t end up making the ‘perfect’ the enemy of the good.

 April 14, 2015

So George R.R. Martin has spent quite a bit of time over the last week, and even a more energy, on responding to this thing called Puppygate. Which not directly related got some commonalities with last years gamergate story.

Yet, what strikes me skimming through a plethora of online discussions is that we seemed to have perfected name calling as an argumentation technique on the Internet. And since the terminology jungle can be a bit confusing I thought I should break it down by putting together this list of common names, so that you know which one you are Trying for a tongue in cheek summary here, but if I fail at least you know what names to call me!

• If you don’t think the world economy is working perfectly you are a Communist
• If you are not in favor of free immigration you are a Racist
• If you think women have achieved any level of equality today you are a Misogynist
• If you are critical to any part of the Christian faith you are a Christian hater or taking part in a War on Christmas
• If you think anything is well in the world today you are a Straight White Male
• If you are against any Israeli policies your an Antisemite or Neo-nazi
• If you feel that gays, women or ethnic minorities are not having their voice heard on an equal level you are a Social Justice Warrior
• If you are critical of any part of the Quran or Islamic practice you are an Islamophobe

I am sure there are a lots more, these are just a collection of some I seen recently. So while I think there are real issues behind many of these examples I think sadly the Internet seems best for turning anything into a binary question, which surprisingly turns out to not be a super method for finding common ground, and as it turns out if you can associate a name to these things you can turn them into binary questions even faster!

Not that these things are specifically tied to the Internet, long time since I last saw any kind of political debate that felt like there was any time or interest in bringing out the nuances into the discussion; We do live in the age of bumper sticker politics after all.

 April 07, 2015

On Intel's Gen graphics, three source instructions like MAD and LRP cannot have constants as arguments. When support for MAD instructions was introduced with Sandybridge, we assumed the choice between a MOV+MAD and a MUL+ADD sequence was inconsequential, so we chose to perform the multiply and add operations separately. Revisiting that assumption has uncovered some interesting things about the hardware and has lead us to some pretty nice performance improvements.

On Gen 7 hardware (Ivybridge, Haswell, Baytrail), multiplies and adds without immediate value arguments can be co-issued, meaning that multiple instructions can be issued from the same execution unit in the same cycle. MADs, never having immediates as sources, can always be co-issued. Considering that, we should prefer MADs, but a typical vec4 * vec4 + vec4(constant) pattern would lead to three duplicate (four total) MOV imm instructions.

mov(8)  g10<1>F    1.0F
mov(8)  g11<1>F    1.0F
mov(8)  g12<1>F    1.0F
mov(8)  g13<1>F    1.0F
mad(8)  g43<1>F    g13<8,8,1>F   g23<8,8,1>F   g33<8,8,1>F

Should be easy to clean up, right? We should simply combine those 1.0F MOVs and modify the MAD instructions to access the same register. Well, conceptually yes, but in practice not quite.

Since the i965 driver's fragment shader backend doesn't use static single assignment form (it's on our TODO list), our common subexpression elimination pass has to emit a MOV instruction when combining instructions. As a result, performing common subexpression elimination on immediate MOVs would undo constant propagation and the compiler's optimizer would go into an infinite loop. Not what you wanted.

Instead, I wrote a pass that scans the instruction list after the main optimization loop and creates a list of immediate values that are used. If an immediate value is used by a 3-source instruction (a MAD or a LRP) or at least four times by an instruction that can co-issue (ADD, MUL, CMP, MOV) then it's put into a register and sourced from there.

But there's still room for improvement. Each general register can store 8 floats, and instead of storing 8 separate constants in each, we're storing a single constant 8 times (and on SIMD16, 16 times!). Fixing that wasn't hard, and it significantly reduces register usage - we now only use one register for each 8 immediate values. Using a special vector-float immediate type we can even load four floating-point values in a single instruction.

With that in place, we can now always emit MAD instructions.

I'm pretty pleased with the results. Without using the New Intermediate Representation (NIR), the shader-db results are:

total instructions in shared programs: 5895414 -> 5747578 (-2.51%)
instructions in affected programs:     3618111 -> 3470275 (-4.09%)

total instructions in shared programs: 7992936 -> 7772474 (-2.76%)
instructions in affected programs:     3738730 -> 3518268 (-5.90%)

## Effects on a WebGL microbenchmark

In December, I checked what effect my constant combining pass would have on a WebGL procedural noise demo. The demo generates an effect ("noise") that looks like a ball of fire. Its fragment shader contains a ton of instructions but no texturing operations. We're currently able to compile the program in SIMD8 without spilling any registers, but at a cost of scheduling the instructions very badly.

The effects the constant combining pass has on this demo are really interesting, and it actually gives me evidence that some of the ideas I had for the pass are valid, namely that co-issuing instructions is worth a little extra register pressure.

1. 1.00x FPS of baseline - 3123 instructions - baseline
2. 1.09x FPS of baseline - 2841 instructions - after promoting constants only if used by more than 2 MADs

Going from no-constant-combining to restricted-constant-combining gives us a 9% increase in frames per second for a 9% instruction count reduction. We're totally limited by fragment shader performance.

1. 1.46x FPS of baseline - 2841 instructions - after promote any constant used by a MAD

Going from step 2 to 3 though is interesting. The instruction count doesn't change, but we reduced register pressure sufficiently that we can now schedule instructions better without spilling (SCHEDULE_PRE, instead of SCHEDULE_PRE_NON_LIFO) - a 33% speed up just by rearranging instructions.

1. 1.62x FPS of baseline - 2852 instructions - after promoting constants used by at least 4 co-issueable instructions

I was worried that we weren't going to be able to measure any performance difference from pulling constants out of co-issueable instructions, but we can definitely get a nice improvement here, of about 10% increase in frames per second.

As an aside, I did an experiment to see what would happen if we used SCHEDULE_PRE and spilled registers anyway (I added a couple of extra instructions to increase register pressure over the threshold). I changed the window size to 2048x2048 and rendered a fixed number of frames.

• SCHEDULE_PRE with no spills: 17.5 seconds
• SCHEDULE_PRE with 4 spills (8 send instructions): 17.5 seconds
• SCHEDULE_PRE_NON_LIFO with no spills: 28 seconds

So there's some good evidence that the cure is worse than the disease. Of course this demo doesn't do any texturing, so memory bandwidth is not at a premium.

1. 1.76x FPS of baseline - 2609 instructions - ???

I ran the demo to see if we'd made any changes in the last two months and was pleasantly surprised to find that we'd cut another 9% of instructions. I have no idea what caused it, but I'll take it! Combined with everything else, we're up to a 76% performance improvement.

## Where's the code

The Mesa patches that implement the constant combining pass were committed (commit bb33a31c) and will be in the next major release (presumably version 10.6).

If any of this sounds interesting enough that you'd like to do it for a living, feel free to contact me. My team at Intel is responsible for the open source 3D driver in Mesa and is looking for new talent.

 April 02, 2015
Presentation and conferencing

Last week-end, in the Salle des Rancy in Lyon, GNOME folks (Fred Peters, Mathieu Bridon and myself) set up our booth at the top of the stairs, the space graciously offered by Ubuntu-FR and Fedora being a tad bit small. The JdLL were starting.

We gave away a few GNOME 3.14 Live and install DVDs (more on that later), discussed much-loved features, and hated bugs, and how to report them. A very pleasant experience all-in-all.

On Sunday afternoon, I did a small presentation about GNOME's 15 years. Talking about the upheaval, dragging kernel drivers and OS components kicking and screaming to work as their APIs say they should, presenting GNOME 3.16 new features and teasing about upcoming GNOME 3.18 ones.

During the Q&A, we had a few folks more than interested in support for tablets and convertible devices (such as the Microsoft Surface, and Asus T100). Hopefully, we'll be able to make the OS support good enough for people to be able to use any Linux distribution on those.

Sideshow with the Events box

Due to scheduling errors on my part, we ended up with the "v1" events box for our booth. I made a few changes to the box before we used it:

• Removed the 17" screen, and replaced it with a 21" widescreen one with speakers builtin. This is useful when we can't setup the projector because of the lack of walls.
• Upgraded machine to 1GB of RAM, thanks to my hoarding of old parts.
• Bought a French keyboard and removed the German one (with missing keys), cleaned up the UK one (which still uses IR wireless).
• Threw away GNOME 3.0 CDs (but kept the sleeves that don't mention the minor version). You'll need to take a sharpie to the small print on the back of the sleeve if you don't fill it with an OpenSUSE CD (we used Fedora 21 DVDs during this event).
• Triaged the batteries. Office managers, get this cheap tester!
• The machine's Wi-Fi was unstable, causing hardlocks (please test again if you use a newer version of the kernel/distributions). We tried to get onto the conference network through the wireless router, and installed DD-WRT on it as the vendor firmware didn't allow that.
• The Nokia N810 and N800 tablets will going to kernel developers that are working on Nokia's old Linux devices and upstreaming drivers.
The events box is still in Lyon, until I receive some replacement hardware.

The machine is 7 years-old (nearly 8!) and only had 512MB of RAM, after the 1GB upgrade, the machine was usable, and many people were impressed by the speed of GNOME on a legacy machine like that (probably more so than a brand new one stuttering because of a driver bug, for example).

This makes you wonder what the use for "lightweight" desktop environments is, when a lot of the features are either punted to helpers that GNOME doesn't need or not implemented at all (old CPU and no 3D driver is pretty much the only use case for those).

I'll be putting it in a small SSD into the demo machine, to give it another speed boost. We'll also be needing a new padlock, after an emergency metal saw attack was necessary on Sunday morning. Five different folks tried to open the lock with the code read off my email, to no avail. Did we accidentally change the combination? We'll never know.

New project, ish

For demo machines, especially newly installed ones, you'll need some content to demo applications. This is my first attempt at uniting GNOME's demo content for release notes screenshots, with some additional content that's free to re-distribute. The repository will eventually move to gnome.org, obviously.

Thanks

The new keyboard and mouse, monitor, padlock, and SSD (and my time) were graciously sponsored by Red Hat.
[This is a cross-post from the mail I just sent out to intel-gfx.]

Code of conducts seem to be in the news a bit recently, and I realized that I've never really documented how we run things. It's different from the kernel's overall CodeOfConflict and also differs from the official Intel/OTC one in small details about handling issues. And for completeness there's also the Xorg Foundation event policy. Anyway, I think this is worth clarifying and here it goes.

It's simple: Be respectful, open and excellent to each another.

Which doesn't mean we want to sacrifice quality to be nice. Striving for technical excellence very much doesn't exclude being excellent to someone else, and in our experience it tends to go hand in hand.

Unfortunately things go south occasionally. So if you feel threatened, personally abused or otherwise uncomfortable, even and especially when you didn't participate in a discussion yourself, then please raise this in private with the drm/i915 maintainers (currently Daniel Vetter and Jani Nikula, see MAINTAINERS for contact information). And the "in private" part is important: Humans screw up, disciplining minor fumbles by tarnishing someones google-able track record forever is out of proportion.

Still there are some teeth to this code of conduct:

1. First time around minor issues will be raised in private.

2. On repeat cases a public reply in the discussion will enforce that respectful behavior is expected.

3. We'll ban people who don't get it.

And severe cases will be escalated much quicker.

This applies to all community communication channels (irc, mailing list and bugzilla). And as mentioned this really just is a public clarification of the rules already in place - you can't see that though since we never had to go further than step 1.

Let's keep it at that.

And in case you have a problem with an individual drm/i915 maintainer and don't want to raise it with the other one there's the Xorg BoD, linux foundation TAB and the drm upstream maintainer Dave Airlie.
 March 30, 2015

And once again, it’s time for another Limba blogpost

Limba is a solution to install 3rd-party software on Linux, without interfering with the distribution’s native package manager. It can be useful to try out different software versions, use newer software on a stable OS release or simply to obtain software which does not yet exist for your distribution.

Limba works distribution-independent, so software authors only need to publish their software once for all Linux distributions.

I recently released version 0.4, with which all most important features you would expect from a software manager are complete. This includes installing & removing packages, GPG-signing of packages, package repositories, package updates etc. Using Limba is still a bit rough, but most things work pretty well already.

So, it’s time for another progress report. Since a FAQ-like list is easier to digest. compared to a long blogpost, I go with this again. So, let’s address one important general question first:

### How does Limba relate to the GNOME Sandboxing approach?

(If you don’t know about GNOMEs sandboxes, take a look at the GNOME Wiki – Alexander Larsson also blogged about it recently)

First of all: There is no rivalry here and no NIH syndrome involved. Limba and GNOMEs Sandboxes (XdgApp) are different concepts, which both have their place.

The main difference between both projects is the handling of runtimes. A runtime is the shared libraries and other shared ressources applications use. This includes libraries like GTK+/Qt5/SDL/libpulse etc. XdgApp applications have one big runtime they can use, built with OSTree. This runtime is static and will not change, it will only receive critical security updates. A runtime in XdgApp is provided by a vendor like GNOME as a compilation of multiple single libraries.

Limba, on the other hand, generates runtimes on the target system on-the-fly out of several subcomponents with dependency-relations between them. Each component can be updated independently, as long as the dependencies are satisfied. The individual components are intended to be provided by the respective upstream projects.

Both projects have their individual up and downsides: While the static runtime of XdgApp projects makes testing simple, it is also harder to extend and more difficult to update. If something you need is not provided by the mega-runtime, you will have to provide it by yourself (e.g. we will have some applications ship smaller shared libraries with their binaries, as they are not part of the big runtime).

Limba does not have this issue, but instead, with its dynamic runtimes, relies on upstreams behaving nice and not breaking ABIs in security updates, so existing applications continue to be working even with newer software components.

Obviously, I like the Limba approach more, since it is incredibly flexible, and even allows to mimic the behaviour of GNOMEs XdgApp by using absolute dependencies on components.

### Do you have an example of a Limba-distributed application?

Yes! I recently created a set of package for Neverball – Alexander Larsson also created a XdgApp bundle for it, and due to the low amount of stuff Neverball depends on, it was a perfect test subject.

One of the main things I want to achieve with Limba is to integrate it well with continuous integration systems, so you can automatically get a Limba package built for your application and have it tested with the current set of dependencies. Also, building packages should be very easy, and as failsafe as possible.

You can find the current Neverball test in the Limba-Neverball repository on Github. All you need (after installing Limba and the build dependencies of all components) is to run the make_all.sh script.

Later, I also want to provide helper tools to automatically build the software in a chroot environment, and to allow building against the exact version depended on in the Limba package.

Creating a Limba package is trivial, it boils down to creating a simple “control” file describing the dependencies of the package, and to write an AppStream metadata file. If you feel adventurous, you can also add automatic build instructions as a YAML file (which uses a subset of the Travis build config schema)

This is the Neverball Limba package, built on Tanglu 3, run on Fedora 21:

### Which kernel do I need to run Limba?

The Limba build tools run on any Linux version, but to run applications installed with Limba, you need at least Linux 3.18 (for Limba 0.4.2). I plan to bump the minimum version requirement to Linux 4.0+ very soon, since this release contains some improvements in OverlayFS and a few other kernel features I am thinking about making use of.

Linux 3.18 is included in most Linux distributions released in 2015 (and of course any rolling release distribution and Fedora have it).

### Building all these little Limba packages and keeping them up-to-date is annoying…

Yes indeed. I expect that we will see some “bigger” Limba packages bundling a few dependencies, but in general this is a pretty annoying property of Limba currently, since there are so few packages available you can reuse. But I plan to address this. Behind the scenes, I am working on a webservice, which will allow developers to upload Limba packages.

This central ressource can then be used by other developers to obtain dependencies. We can also perform some QA on the received packages, map the available software with CVE databases to see if a component is vulnerable and publish that information, etc.

All of this is currently planned, and I can’t say a lot more yet. Stay tuned! (As always: If you want to help, please contact me)

### Are the Limba interfaces stable? Can I use it already?

The Limba package format should be stable by now – since Limba is still Alpha software, I will however, make breaking changes in case there is a huge flaw which makes it reasonable to break the IPK package format. I don’t think that this will happen though, as the Limba packages are designed to be easily backward- and forward compatible.

For the Limba repository format, I might make some more changes though (less invasive, but you might need to rebuilt the repository).

tl;dr: Yes! Plase use Limba and report bugs, but keep in mind that Limba is still in an early stage of development, and we need bug reports!

### Will there be integration into GNOME-Software and Muon?

From the GNOME-Software side, there were positive signals about that, but some technical obstancles need to be resolved first. I did not yet get in contact with the Muon crew – they are just implementing AppStream, which is a prerequisite for having any support for Limba[1].

Since PackageKit dropped the support for plugins, every software manager needs to implement support for Limba.

So, thanks for reading this (again too long) blogpost There are some more exciting things coming soon, especially regarding AppStream on Debian/Ubuntu!

[1]: And I should actually help with the AppStream support, but currently I can not allocate enough time to take that additional project as well – this might change in a few weeks. Also, Muon does pretty well already!

 March 25, 2015
Did you see?

It will obviously be in Fedora 22 Beta very shortly.

What happened since 3.14? Quite a bit, and a number of unfinished projects will hopefully come to fruition in the coming months.

Hardware support

After quite a bit of back and forth, automatic rotation for tablets will not be included directly in systemd/udev, but instead in a separate D-Bus daemon. The daemon has support for other sensor types, Ambient Light Sensors (ColorHug ALS amongst others) being the first ones. I hope we have compass support soon too.

Support for the Onda v975w's touchscreen and accelerometer are now upstream. Work is on-going for the Wi-Fi driver.

I've started some work on supporting the much hated Adaptive keyboard on the X1 Carbon 2nd generation.

Technical debt

In the last cycle, I've worked on triaging gnome-screensaver, gnome-shell and gdk-pixbuf bugs.

The first got merged into the second, the second got plenty of outdated bugs closed, and priorities re-evaluated as a result.

I wrangled old patches and cleaned up gdk-pixbuf. We still have architectural problems in the library for huge images, but at least we're up to a state where we know what the problems are, not being buried in Bugzilla.

Foundation building

A couple of projects got started that didn't reached maturation yet. I'm pretty happy that we're able to use gnome-books (part of gnome-documents) today to read Comic books. ePub support is coming!

Grilo saw plenty of activity. The oft requested "properties" page in Totem is closer than ever, so is series grouping.

In December, Allan and I met with the ABRT team, and we've landed some changes we discussed there, including a simple "Report bugs" toggle in the Privacy settings, with a link to the OS' privacy policy. The gnome-abrt application had a facelift, but we got somewhat stuck on technical problems, which should get solved in the next cycle. The notifications were also streamlined and simplified.

I'm a fan

Of the new overlay scrollbars, and the new gnome-shell notification handling. And I'm cheering on co-new app in 3.16, GNOME Calendar.

There's plenty more new and interesting stuff in the release, but I would just be duplicating much of the GNOME 3.16 release notes.
 March 20, 2015

Le week-end prochain, je vais faire une petite présentation sur les quinze ans de GNOME aux JdLL.

Si les dieux de la livraison sont cléments, GNOME devrait aussi avoir une présence dans le village associatif.
Now that Presentation feedback has finally landed in Weston (feedback, flags), people are starting to pay attention to the output timings as now you can better measure them. I have seen a couple of complaints already that Weston has an extra frame of latency, and this is true. I also have a patch series to fix it that I am going to propose.

To explain how the patch series affects Weston's repaint loop, I made some JSON-timeline recordings before and after, and produced some graphs with Wesgr. Here I will explain how the repaint loop works timing-wise.

Original post Feb 11, 2015.
Update Mar 20, 2015: the patches have landed in Weston.

### The old algorithm

The old repaint scheduling algorithm in Weston repaints immediately on receiving the pageflip completion event. This maximizes the time available for the compositor itself to repaint, but it also means that clients can never hit the very next vblank / pageflip.

 Figure 1. The old algorithm, the client paints as response to frame callbacks.

Frame callback events are sent at the "post repaint" step. This gives clients almost a full frame's time to draw and send their content before the compositor goes to "begin repaint" again. In Figure 1. you see, that if a client paints extremely fast, the latency to screen is almost two frame periods. The frame latency can never be less than one frame period, because the compositor samples the surface contents (the "repaint flush" point) immediately after the previous vblank.

 Figure 2. The old algorithm, the client paints as response to Presentation feedback events.

While frame callback driven clients still get to the full frame rate, the situation is worse if the client painting is driven by presentation_feedback.presented events. The intent is to draw and show a new frame as soon as the old frame was shown. Because Weston starts repaint immediately on the pageflip completion, which is essentially the same time when Presentation feedback is sent, the client cannot hit the repaint of this frame and gets postponed to the next. This is the same two frame latency as with frame callbacks, but here the framerate is halved because the client waits for the frame to be actually shown before continuing, as is evident in Figure 2.

 Figure 3. The old algorithm, client posts a frame while the compositor is idle.

Figure 3. shows a less relevant case, where the compositor is idle while a client posts a new frame ("damage commit"). When the compositor is idle graphics-wise (the gray background in the figure), it is not repainting continuously according to the output scanout cycle. To start painting again, Weston waits for an extra vblank first, then repaints, and then the new frame is shown on the next vblank. This is also a 1-2 frame period latency, but it is unrelated to the other two cases, and is not changed by the patches.

### The modification to the algorithm

The modification is simple, yet perhaps counter-intuitive at first. We reduce the latency by adding a delay. The "delay before repaint" is in all the figures, and the old algorithm is essentially using a zero delay. The compositor's repaint is delayed so that clients have a chance to post a new frame before the compositor samples the surface contents.

A good amount of delay is a hard question. Too small delay and clients do not have time to act. Too long delay and the compositor itself will be in danger of missing the vblank deadline. I do not know what a good amount is or how to derive it, so I just made it configurable. You can set the repaint window length in milliseconds in weston.ini. The repaint window is the time from starting repaint to the deadline, so the delay is the frame period minus the repaint window. If the repaint window is too long for a frame period, the algorithm will reduce to the old behaviour.

### The new algorithm

The following figures are made with a 60 Hz refresh and a 7 millisecond repaint window.

 Figure 4. The new algorithm, the client paints as response to frame callback.

When a client paints as response to the frame callback (Figure 4), it still has a whole frame period of time to paint and post the frame. The total latency to screen is a little shorter now, by the length of the delay before compositor's repaint. It is a slight improvement.

 Figure 5. The new algorithm, the client paints as response to Presentation feedback.

A significant improvement can be seen in Figure 5. A client that uses the Presentation extension to wait for a frame to be actually shown before painting again is now able to reach the full output frame rate. It just needs to paint and post a new frame during the delay before compositor's repaint. This mode of operation provides the shortest possible latency to screen as the client is able to target the very next vblank. The latency is below one frame period if the deadlines are met.

### Discussion

This is a relatively simple change that should reduce display latency, but analyzing how exactly it affects things is not trivial. That is why Wesgr was born.

This change does not really allow clients to wait some additional time before painting to reduce the latency even more, because nothing tells clients when the compositor will repaint exactly. The risk of missing an unknown deadline grows the later a client paints. Would knowing the deadline have practical applications? I'm not sure.

These figures also show the difference between the frame callback and Presentation feedback. When a client's repaint loop is driven by frame callbacks, it maximizes the time available for repainting, which reduces the possibility to miss the deadline. If a client drives its repaint loop by Presentation feedback events, it minimizes the display latency at the cost of increased risk of missing the deadline.

All the above ignores a few things. First, we assume that the time of display is the point of vblank which starts to scan out the new frame. Scanning out a frame actually takes most of the frame period, it's not instantaneous. Going deeper, updating the framebuffer during scanout period instead of vblank could allow reducing latency even more, but the matter becomes complicated and even somewhat subjective. I hear some people prefer tearing to reduce the latency further. Second, we also ignore any fencing issues that might come up in practise. If a client submits a GPU job that takes a long while, there is a good chance it will cause everything to miss a deadline or more.

As usual, this work and most of the development of JSON-timeline and Wesgr were sponsored by Collabora.

PS. Latency and timing issues are nothing new. Owen Taylor has several excellent posts on related subjects in his blog.
 March 19, 2015

[Update 19/03/15]: TLDR: kernel patches queued for 4.0 do not require any userspace changes.

Lenovo released a new set of laptops for 2015 with a new (old) feature: the trackpoint device has the physical buttons back. Last year's experiment apparently didn't work out so well.

What do we do in Linux with the last generation's touchpads? The kernel marks them with INPUT_PROP_TOPBUTTONPAD based on the PNPID [1]. In the X.Org synaptics driver and libinput we take that property and emulate top software buttons based on it. That took us a while to get sorted and feed into the myriad of Linux distributions out there but at some point last year the delicate balance of nature was restored and the touchpad-related rage dropped to the usual background noise.

Slow-forward to 2015 and Lenovo introduces the new series. In the absence of unnecessary creativity they are called the X1 Carbon 3rd, T450, T550, X250, W550, L450, etc. Lenovo did away with the un(der)-appreciated top software buttons and re-introduced physical buttons for the trackpoint. Now, that's bound to make everyone happy again. However, as we learned from Agent Smith, happiness is not the default state of humans so Lenovo made sure the harvest is safe.

What we expected to happen was that the trackpoint device has BTN_LEFT, BTN_MIDDLE, BTN_RIGHT and the touchpad has BTN_LEFT and is marked with INPUT_PROP_BUTTONPAD (i.e. it is a Clickpad). That is the case on the x220 generation and the T440 generation. Though the latter doesn't actually have trackpoint buttons and we emulated them in software.

On the X1 Carbon 3rd, the trackpoint has BTN_LEFT, BTN_MIDDLE, BTN_RIGHT but they never send events. The touchpad has BTN_LEFT and BTN_0, BTN_1 and BTN_2 [2]. Clicking the left button on the trackpoint generates BTN_0 on the touchpad device, clicking the right button generates BTN_1 on the touchpad device. So in short, Lenovo has decided to wire the newly re-introduced trackpoint buttons to the touchpad, not the trackpoint. [3] The middle button is currently dead, which is a kernel bug. Meanwhile we think of it as security feature - never accidentally paste your password into your IRC session again!

What does this mean for us? Neither synaptics nor evdev nor libinput currently support this so we've been busy apidae and writing patches like crazy. [Update 19/03/15]: The patches queued in Dmitry's for-linus branch re-route the trackstick buttons in the kernel through the trackstick device. To userspace, the trackstick thus looks the same as in the past and no userspace patches are needed. I've reverted the synaptics patch already, the libinput patch will follow. The patch goes into the kernel and udev.... The two patches needed go into the kernel and udev, and libinput. No, the three patches needed go into the kernel, udev and libinput, and synaptics. The four patches, no, wait. Amongst the projects needing patches are the kernel, udev, libinput and synaptics. I'll try again:

With those put together, things pretty much work as they're supposed to. libinput handles middle button scrolling as well this way but synaptics won't, much for the same reason it didn't work in the previous generation: synaptics can't talk to evdev and vice versa. And given that synaptics is on life support or in pallative care, depending how you look at it, I recommend not holding your breath for a fix. Otherwise you may join it quickly.

Note that all the patches are fresh off the presses and there may be a few bits changing before they are done. If you absolutely can't live without the trackpoint's buttons you can work around it by disabling the synaptics kernel driver until the patches have trickled down to your distribution.

The tracking bug for all this is Bug 88609. Feel free to CC yourself on it. Bring popcorn.

Final note: I haven't seen logs from the T450, T550, ... devices yet yet so this is so far only confirmed on the X1 Carbon so far. Given the hardware is essentially identical I expect it to be true for the rest of the series though.

[1] We also apply quirks for the 2013 generation because the firmware was buggy - a problem Synaptics Inc. has since fixed (but currently gives us slight headaches).
[2] It is also marked with INPUT_PROP_TOPBUTTONPAD which is a bug. It uses a new PNPID but one that was in the range we previously believed was for pads without trackpoint buttons. That's an an easy thing to fix.
[3] The reason for that seems to be HW design: this way they can keep the same case/keyboard and just swap the touchpad bits.
[4] synaptics is old enough to support dedicated scroll buttons. Buttons that used to send BTN_0 and BTN_1 and are thus interpreted as scroll up/down event.

 March 16, 2015
So I've still been working on the virgil3d project along with part time help from Marc-Andre and Gerd at Red Hat, and we've been making steady progress. This post is about a test harness I just finished developing for adding and debugging GL features.

So one of the more annoying issuess with working on virgil has been that while working on adding 3D renderer features or trying to track down a piglit failure, you generally have to run a full VM to do so. This adds a long round trip in your test/development cycle.

I'd always had the idea to do some sort of local system renderer, but there are some issues with calling GL from inside a GL driver. So my plan was to have a renderer process which loads the renderer library that qemu loads, and a mesa driver that hooks into the software rasterizer interfaces. So instead of running llvmpipe or softpipe I have a virpipe gallium wrapper, that wraps my virgl driver and the sw state tracker via a new vtest winsys layer for virgl.

So the virgl pipe driver sits on top of the new winsys layer, and the new winsys instead of using the Linux kernel DRM apis just passes the commands over a UNIX socket to a remote server process.

The remote server process then uses EGL and the renderer library, forks a new copy for each incoming connection and dies off when the rendering is done.

The final rendered result has to be read back over the socket, and then the sw winsys is used to putimage the rendering onto the screen.

So this system is probably going to be slower in raw speed terms, but for developing features or debugging fails it should provide an easier route without the overheads of the qemu process. I was pleasantly surprised it only took two days to pull most of this test harness together which was neat, I'd planned much longer for it!

The code lives in two halves.
http://cgit.freedesktop.org/~airlied/virglrenderer
http://cgit.freedesktop.org/~airlied/mesa virgl-mesa-driver

[updated: pushed into the main branches]

Also the virglrenderer repo is standalone now, it also has a bunch of unit tests in it that are run using valgrind also, in an attempt to lock down some more corners of the API and test for possible ways to escape the host.
 March 06, 2015

Recap

My previous post served as an initial look into Mesa’s GLSL compiler, where we discussed the Mesa IR, which is a core aspect of the compiler. In this post I’ll introduce another relevant aspect: IR lowering.

IR lowering

There are multiple lowering passes implemented in Mesa (check src/glsl/lower_*.cpp for a complete list) but they all share a common denominator: their purpose is to re-write certain constructs in the IR so they fit better the underlying GPU hardware.

In this post we will look into the lower_instructions.cpp lowering pass, which rewrites expression operations that may not be supported directly by GPU hardware with different implementations.

The lowering process involves traversing the IR, identifying the instructions we want to lower and modifying the IR accordingly, which fits well into the visitor pattern strategy discussed in my previous post. In this case, expression lowering is handled by the lower_instructions_visitor class, which implements the lowering pass in the visit_leave() method for ir_expression nodes.

The hierarchical visitor class, which serves as the base class for most visitors in Mesa, defines visit() methods for leaf nodes in the IR tree, and visit_leave()/visit_enter() methods for non-leaf nodes. This way, when traversing intermediary nodes in the IR we can decide to take action as soon as we enter them or when we are about to leave them.

In the case of our lower_instructions_visitor class, the visit_leave() method implementation is a large switch() statement with all the operators that it can lower.

The code in this file lowers common scenarios that are expected to be useful for most GPU drivers, but individual drivers can still select which of these lowering passes they want to use. For this purpose, hardware drivers create instances of the lower_instructions class passing the list of lowering passes to enable. For example, the Intel i965 driver does:

const int bitfield_insert = brw->gen >= 7
? BITFIELD_INSERT_TO_BFM_BFI
: 0;
MOD_TO_FLOOR |
DIV_TO_MUL_RCP |
EXP_TO_EXP2 |
LOG_TO_LOG2 |
bitfield_insert |
LDEXP_TO_ARITH);


Notice how in the case of Intel GPUs, one of the lowering passes is conditionally selected depending on the hardware involved. In this case, brw->gen >= 7 selects GPU generations since IvyBridge.

Let’s have a look at the implementation of some of these lowering passes. For example, SUB_TO_ADD_NEG is a very simple one that transforms subtractions into negative additions:

void
{
ir->operands[1] =
new(ir) ir_expression(ir_unop_neg, ir->operands[1]->type,
ir->operands[1], NULL);
this->progress = true;
}


As we can see, the lowering pass simply changes the operator used by the ir_expression node, and negates the second operand using the unary negate operator (ir_unop_neg), thus, converting the original a = b – c into a = b + (-c).

Of course, if a driver does not have native support for the subtraction operation, it could still do this when it processes the IR to produce native code, but this way Mesa is saving driver developers that work. Also, some lowering passes may enable optimization passes after the lowering that drivers might miss otherwise.

Let’s see a more complex example: MOD_TO_FLOOR. In this case the lowering pass provides an implementation of ir_binop_mod (modulo) for GPUs that don’t have a native modulo operation.

The modulo operation takes two operands (op0, op1) and implements the C equivalent of the ‘op0 % op1‘, that is, it computes the remainder of the division of op0 by op1. To achieve this the lowering pass breaks the modulo operation as mod(op0, op1) = op0 – op1 * floor(op0 / op1), which requires only multiplication, division and subtraction. This is the implementation:

ir_variable *x = new(ir) ir_variable(ir->operands[0]->type, "mod_x",
ir_var_temporary);
ir_variable *y = new(ir) ir_variable(ir->operands[1]->type, "mod_y",
ir_var_temporary);
this->base_ir->insert_before(x);
this->base_ir->insert_before(y);

ir_assignment *const assign_x =
new(ir) ir_assignment(new(ir) ir_dereference_variable(x),
ir->operands[0], NULL);
ir_assignment *const assign_y =
new(ir) ir_assignment(new(ir) ir_dereference_variable(y),
ir->operands[1], NULL);

this->base_ir->insert_before(assign_x);
this->base_ir->insert_before(assign_y);

ir_expression *const div_expr =
new(ir) ir_expression(ir_binop_div, x->type,
new(ir) ir_dereference_variable(x),
new(ir) ir_dereference_variable(y));

/* Don't generate new IR that would need to be lowered in an additional
* pass.
*/
if (lowering(DIV_TO_MUL_RCP) && (ir->type->is_float() ||
ir->type->is_double()))
div_to_mul_rcp(div_expr);

ir_expression *const floor_expr =
new(ir) ir_expression(ir_unop_floor, x->type, div_expr);

if (lowering(DOPS_TO_DFRAC) && ir->type->is_double())
dfloor_to_dfrac(floor_expr);

ir_expression *const mul_expr =
new(ir) ir_expression(ir_binop_mul,
new(ir) ir_dereference_variable(y),
floor_expr);

ir->operation = ir_binop_sub;
ir->operands[0] = new(ir) ir_dereference_variable(x);
ir->operands[1] = mul_expr;
this->progress = true;


Notice how the first thing this does is to assign the operands to a variable. The reason for this is a bit tricky: since we are going to implement ir_binop_mod as op0 – op1 * floor(op0 / op1), we will need to refer to the IR nodes op0 and op1 twice in the tree. However, we can’t just do that directly, for that would mean that we have the same node (that is, the same pointer) linked from two different places in the IR expression tree. That is, we want to have this tree:

                             sub
/     \
op0       mult
/    \
op1     floor
|
div
/   \
op0     op1


                             sub
/     \
|      mult
|     /   \
|   floor  |
|     |    |
|    div   |
|   /   \  |
op0     op1


This second version of the tree is problematic. For example, let’s say that a hypothetical optimization pass detects that op1 is a constant integer with value 1, and realizes that in this case div(op0/op1) == op0. When doing that optimization, our div subtree is removed, and with that, op1 could be removed too (and possibily freed), leaving the other reference to that operand in the IR pointing to an invalid memory location… we have just corrupted our IR:

                             sub
/     \
|      mult
|     /    \
|   floor   op1 [invalid pointer reference]
|     |
|    /
|   /
op0


Instead, what we want to do here is to clone the nodes each time we need a new reference to them in the IR. All IR nodes have a clone() method for this purpose. However, in this particular case, cloning the nodes creates a new problem: op0 and op1 are ir_expression nodes so, for example, op0 could be the expression a + b * c, so cloning the expression would produce suboptimal code where the expression gets replicated. This, at best, will lead to slower
compilation times due to optimization passes needing to detect and fix that, and at worse, that would go undetected by the optimizer and lead to worse performance where we compute the value of the expression multiple times:

                              sub
/        \
/   \       /    \
a     mult  op1     floor
/   \          |
b     c        div
/   \
/   \
a    mult
/    \
b     c


The solution to this problem is to assign the expression to a variable, then dereference that variable (i.e., read its value) wherever we need. Thus, the implementation defines two variables (x, y), assigns op0 and op1 to them and creates new dereference nodes wherever we need to access the value of the op0 and op1 expressions:

                       =               =
/   \           /   \
x     op0       y     op1

sub
/     \
*x       mult
/    \
*y     floor
|
div
/   \
*x     *y


In the diagram above, each variable dereference is marked with an ‘*’, and each one is a new IR node (so both appearances of ‘*x’ refer to different IR nodes, both representing two different reads of the same variable). With this solution we only evaluate the op0 and op1 expressions once (when they get assigned to the corresponding variables) and we never refer to the same IR node twice from different places (since each variable dereference is a new IR node).

Now that we know why we assign these two variables, let’s continue looking at the code of the lowering pass:

In the next step we implement op0 / op1 using a ir_binop_div expression. To speed up compilation, if the driver has the DIV_TO_MUL_RCP lowering pass enabled, which transforms a / b into a * 1 / b (where 1 / b could be a native instruction), we immediately execute the lowering pass for that expression. If we didn’t do this here, the resulting IR would contain a division operation that might have to be lowered in a later pass, making the compilation process slower.

The next step uses a ir_unop_floor expression to compute floor(op0/op1), and again, tests if this operation should be lowered too, which might be the case if the type of the operands is a 64bit double instead of a regular 32bit float, since GPUs may only have a native floor instruction for 32bit floats.

Next, we multiply the result by op1 to get op1 * floor(op0 / op1).

Now we only need to subtract this from op0, which would be the root IR node for this expression. Since we want the new IR subtree spawning from this root node to replace the old implementation, we directly edit the IR node we are lowering to replace the ir_binop_mod operator with ir_binop_sub, make a dereference to op1 in the first operand and link the expression holding op1 * floor(op0 / op1) in the second operand, effectively attaching our new implementation in place of the old version. This is how the original and lowered IRs look like:

Original IR:

[prev inst] -> mod -> [next inst]
/   \
op0     op1


Lowered IR:

[prev inst] -> var x -> var y ->   =   ->   =   ->   sub   -> [next inst]
/ \      / \      /   \
x  op0   y  op1  *x     mult
/    \
*y      floor
|
div
/   \
*x     *y


Finally, we return true to let the compiler know that we have optimized the IR and that as a consequence we have introduced new nodes that may be subject to further lowering passes, so it can run a new pass. For example, the subtraction we just added may be lowered again to a negative addition as we have seen before.

Coming up next

Now that we learnt about lowering passes we can also discuss optimization passes, which are very similar since they are also based on the visitor implementation in Mesa and also transform the Mesa IR in a similar way.

As I said in a previous post, this year I attended FOSDEM to give talk about how to test our OpenGL drivers with Free Software. The slides are available online and the video recording has been recently released.

Last but not least, I would like to thank Igalia for sponsoring my trip

libinput supports edge scrolling since version 0.7.0. Whoops, how does the post title go with this statement? Well, libinput supports edge scrolling, but only on some devices and chances are your touchpad won't be one of them. Bug 89381 is the reference bug here.

First, what is edge scrolling? As the libinput documentation illustrates, it is scrolling triggered by finger movement within specific regions of the touchpad - the left and bottom edges for vertical and horizontal scrolling, respectively. This is in contrast to two-finger scrolling, triggered by a two-finger movement, anywhere on the touchpad. synaptics had edge scrolling since at least 2002, the earliest commit in the repo. Back then we didn't have multitouch-capable touchpads, these days they're the default and you'd be struggling to find one that doesn't support at least two fingers. But back then edge-scrolling was the default, and touchpads even had the markings for those scroll edges painted on.

libinput adds a whole bunch of features to the touchpad driver, but those features make it hard to support edge scrolling. First, libinput has quite smart software button support. Those buttons are usually on the lowest ~10mm of the touchpad. Depending on finger movement and position libinput will send a right button click, movement will be ignored, etc. You can leave one finger in the button area while using another finger on the touchpad to move the pointer. You can press both left and right areas for a middle click. And so on. On many touchpads the vertical travel/physical resistance is enough to trigger a movement every time you click the button, just by your finger's logical center moving.

libinput also has multi-direction scroll support. Traditionally we only sent one scroll event for vertical/horizontal at a time, even going as far as locking the scroll direction. libinput changes this and only requires a initial threshold to start scrolling, after that the caller will get both horizontal and vertical scroll information. The reason is simple: it's context-dependent when horizontal scrolling should be used, so a global toggle to disable doesn't make sense. And libinput's scroll coordinates are much more fine-grained too, which is particularly useful for natural scrolling where you'd expect the content to move with your fingers.

Finally, libinput has smart palm detection. The large majority of palm touches are along the left and right edges of the touchpad and they're usually indistinguishable from finger presses (same pressure values for example). Without palm detection some laptops are unusable (e.g. the T440 series).

These features interfere heavily with edge scrolling. Software button areas are in the same region as the horizontal scroll area, palm presses are in the same region as the vertical edge scroll area. The lower vertical edge scroll zone overlaps with software buttons - and that's where you would put your finger if you'd want to quickly scroll up in a document (or down, for natural scrolling). To support edge scrolling on those touchpads, we'd need heuristics and timeouts to guess when something is a palm, a software button click, a scroll movement, the start of a scroll movement, etc. The heuristics are unreliable, the timeouts reduce responsiveness in the UI. So our decision was to only provide edge scrolling on touchpads where it is required, i.e. those that cannot support two-finger scrolling, those with physical buttons. All other touchpads provide only two-finger scrolling. And we are focusing on making 2 finger scrolling good enough that you don't need/want to use edge scrolling (pls file bugs for anything broken)

Now, before you get too agitated: if edge scrolling is that important to you, invest the time you would otherwise spend sharpening pitchforks, lighting torches and painting picket signs into developing a model that allows us to do reliable edge scrolling in light of all the above, without breaking software buttons, maintaining palm detection. We'd be happy to consider it.

This feature got merged for libinput 0.8 but I noticed I hadn't blogged about it. So belatedly, here is a short description of scroll sources in libinput.

Scrolling is a fairly simple concept. You move the mouse wheel and the content moves down. Beyond that the details get quite nitty, possibly even gritty. On touchpads, scrolling is emulated through a custom finger movement (e.g. two-finger scrolling). A mouse wheel moves in discrete steps of (usually) 15 degrees, a touchpad's finger movement is continuous (within the device physical resolution). Another scroll method is implemented for the pointing stick: holding the middle button down while moving the stick will generate scroll events. Like touchpad scroll events, these events are continuous. I'll ignore natural scrolling in this post because it just inverts the scroll direction. Kinetic scrolling ("fling scrolling") is a comparatively recent feature: when you lift the finger, the final finger speed determines how long the software will keep emulating scroll events. In synaptics, this is done in the driver and causes all sorts of issues - the driver may keep sending scroll events even while you start typing.

In libinput, there is no kinetic scrolling at all, what we have instead are scroll sources. Currently three sources are defined, wheel, finger and continuous. Wheel is obvious, it provides the physical value in degrees (see this post) and in discrete steps. The "finger" source is more interesting, it is the hint provided by libinput that the scroll event is caused by a finger movement on the device. This means that a) there are no discrete steps and b) libinput guarantees a terminating scroll event when the finger is lifted off the device. This enables the caller to implement kinetic scrolling: simply wait for the terminating event and then calculate the most recent speed. More importantly, because the kinetic scrolling implementation is pushed to the caller (who will push it to the client when the Wayland protocol for this is ready), kinetic scrolling can be implemented on a per-widget basis.

Finally, the third source is "continuous". The only big difference to "finger" is that we can't guarantee that the terminating event is sent, simply because we don't know if it will happen. It depends on the implementation. For the caller this means: if you see a terminating scroll event you can use it as kinetic scroll information, otherwise just treat it normally.

For both the finger and the continuous sources the scroll distance provided by libinput is equivalent to "pixels", i.e. the value that the relative motion of the device would otherwise send. This means the caller can interpret this depending on current context too. Long-term, this should make scrolling a much more precise and pleasant experience than the old X approach of "You've scrolled down by one click".

The API documentation for all this is here: http://wayland.freedesktop.org/libinput/doc/latest/group__event__pointer.html, search for anything with "pointer_axis" in it.

 March 03, 2015

Recap

In my last post I explained that modern 3D pipelines are programmable and how this has impacted graphics drivers. In the following posts we will go deeper into this aspect by looking at different parts of Mesa’s GLSL compiler. Specifically, this post will cover the GLSL parser, the Mesa IR and built-in variables and functions.

The GLSL parser

The job of the parser is to process the shader source code string provided via glShaderSource and transform it into a suitable binary representation that is stored in RAM and can be efficiently processed by other parts of the compiler in later stages.

The parser consists of a set of Lex/Yacc rules to process the incoming shader source. The lexer (glsl_parser.ll) takes care of tokenizing the source code and the parser (glsl_parser.yy) adds meaning to the stream of tokens identified in
the lexer stage.

Similarly, just like in C or C++, GLSL includes a pre-processor that goes through the shader source code before the main parser kicks in. Mesa’s implementation of the GLSL pre-processor lives in src/glsl/glcpp and is also based on Lex/Yacc rules.

The output of the parser is an Abstract Syntax Tree (AST) that lives in RAM memory, which is a binary representation of the shader source code. The nodes that make this tree are defined in src/glsl/ast.h.

For someone familiar with all the Lex/Yacc stuff, the parser implementation in Mesa should feel familiar enough.

The next step takes care of converting from the AST to a different representation that is better suited for the kind of operations that drivers will have to do with it. This new representation, called the IR (Intermediate Representation), is usually referenced in Mesa as Mesa IR, GLSL IR or simply HIR.

The AST to Mesa IR conversion is driven by the code in src/glsl/ast_to_hir.cpp.

Mesa IR

The Mesa IR is the main data structure used in the compiler. Most of the work that the compiler does can be summarized as:

• Optimizations in the IR
• Modifications in the IR for better/easier integration with GPU hardware
• Generating native assembly code for the target GPU from the IR

As we can see, the Mesa IR is at the core of all the work that the compiler has to do, so understanding how it is setup is necessary to work in this part of Mesa.

The nodes in the Mesa IR tree are defined in src/glsl/ir.h. Let’s have a look at the most important ones:

At the top of the class hierarchy for the IR nodes we have exec_node, which is Mesa’s way of linking independent instructions together in a list to make a program. This means that each instruction has previous and next pointers to the instructions that are before and after it respectively. So, we have ir_instruction, the base class for all nodes in the tree, inherit from exec_node.

Another important node is ir_rvalue, which is the base class used to represent expressions. Generally, anything that can go on the right side of an assignment is an ir_rvalue. Subclasses of ir_rvalue include ir_expression, used to represent all kinds of unary, binary or ternary operations (supported operators are defined in the ir_expression_operation enumeration), ir_texture, which is used to represent texture operations like a texture lookup, ir_swizzle, which is used for swizzling values in vectors, all the ir_dereference nodes, used to access the values stored in variables, arrays, structs, etc. and ir_constant, used to represent constants of all basic types (bool, float, integer, etc).

We also have ir_variable, which represents variables in the shader code. Notice that the definition of ir_variable is quite large… in fact, this is by large the node with the most impact in the memory footprint of the compiler when compiling shaders in large games/applications. Also notice that the IR differentiates between variables and variable dereferences (the fact of looking into a variable’s value), which are represented as an ir_rvalue.

Similarly, the IR also defines nodes for other language constructs like ir_loop, ir_if, ir_assignment, etc.

Debugging the IR is not easy, since the representation of a shader program in IR nodes can be quite complex to traverse and inspect with a debugger. To help with this Mesa provides means to print the IR to a human-readable text format. We can enable this by using the environment variable MESA_GLSL=dump. This will instruct Mesa to print both the original shader source code and its IR representation. For example:

\$ MESA_GLSL=dump ./test_program

GLSL source for vertex shader 1:
#version 140
#extension GL_ARB_explicit_attrib_location : enable

layout(location = 0) in vec3 inVertexPosition;
layout(location = 1) in vec3 inVertexColor;

uniform mat4 MVP;
smooth out vec3 out0;

void main()
{
gl_Position = MVP * vec4(inVertexPosition, 1);
out0 = inVertexColor;
}

(
(declare (sys ) int gl_InstanceID)
(declare (sys ) int gl_VertexID)
(declare (shader_out ) (array float 0) gl_ClipDistance)
(declare (uniform ) (array vec4 56) gl_CurrentAttribFragMESA)
(declare (uniform ) (array vec4 33) gl_CurrentAttribVertMESA)
(declare (uniform ) gl_DepthRangeParameters gl_DepthRange)
(declare (uniform ) int gl_NumSamples)
(declare () int gl_MaxVaryingComponents)
(declare () int gl_MaxClipDistances)
(declare () int gl_MaxFragmentUniformComponents)
(declare () int gl_MaxVaryingFloats)
(declare () int gl_MaxVertexUniformComponents)
(declare () int gl_MaxDrawBuffers)
(declare () int gl_MaxTextureImageUnits)
(declare () int gl_MaxCombinedTextureImageUnits)
(declare () int gl_MaxVertexTextureImageUnits)
(declare () int gl_MaxVertexAttribs)
(declare (uniform ) mat4 MVP)
(function main
(signature void
(parameters
)
(
(declare (temporary ) vec4 vec_ctor)
(assign  (w) (var_ref vec_ctor)  (constant float (1.000000)) )
(assign  (xyz) (var_ref vec_ctor)  (var_ref inVertexPosition) )
(assign  (xyzw) (var_ref gl_Position)
(expression vec4 * (var_ref MVP) (var_ref vec_ctor) ) )
(assign  (xyz) (var_ref out0)  (var_ref inVertexColor) )
))
)
)


Notice, however, that the IR representation we get is not the one that is produced by the parser. As we will see later, that initial IR will be modified in multiple ways by Mesa, for example by adding different kinds of optimizations, so the IR that we see is the result after all these processing passes over the original IR. Mesa refers to this post-processed version of the IR as LIR (low-level IR) and to the initial version of the IR as produced by the parser as HIR (high-level IR). If we want to print the HIR (or any intermediary version of the IR as it transforms into the final LIR), we can edit the compiler and add calls to _mesa_print_ir as needed.

Traversing the Mesa IR

We mentioned before that some of the compiler’s work (a big part, in fact) has to do with optimizations and modifications of the IR. This means that the compiler needs to traverse the IR tree and identify subtrees that are relevant to this kind of operations. To achieve this, Mesa uses the visitor design pattern.

Basically, the idea is that we have a visitor object that can traverse the IR tree and we can define the behavior we want to execute when it finds specific nodes.

For instance, there is a very simple example of this in src/glsl/linker.cpp: find_deref_visitor, which detects if a variable is ever read. This involves traversing the IR, identifying ir_dereference_variable nodes (the ones where a variable’s value is accessed) and check if the name of that variable matches the one we are looking for. Here is the visitor class definition:

/**
* Visitor that determines whether or not a variable is ever read.
*/
class find_deref_visitor : public ir_hierarchical_visitor {
public:
find_deref_visitor(const char *name)
: name(name), found(false)
{
/* empty */
}

virtual ir_visitor_status visit(ir_dereference_variable *ir)
{
if (strcmp(this->name, ir->var->name) == 0) {
this->found = true;
return visit_stop;
}

return visit_continue;
}

bool variable_found() const
{
return this->found;
}

private:
const char *name;       /**< Find writes to a variable with this name. */
bool found;             /**< Was a write to the variable found? */
};


And this is how we get to use this, for example to check if the shader code ever reads gl_Vertex:

find_deref_visitor find("gl_Vertex");
find.run(sh->ir);
if (find.variable_found()) {
(...)
}


Most optimization and lowering passes in Mesa are implemented as visitors and follow a similar idea. We will look at examples of these in a later post.

Built-in variables and functions

GLSL defines a set of built-in variables (with ‘gl_’ prefix) for each shader stage which Mesa injects into the shader code automatically. If you look at the example where we used MESA_GLSL=dump to obtain the generated Mesa IR you can see some of these variables.

Mesa implements support for built-in variables in _mesa_glsl_initialize_variables(), defined in src/glsl/builtin_variables.cpp.

Notice that some of these variables are common to all shader stages, while some are specific to particular stages or available only in specific versions of GLSL.

Depending on the type of variable, Mesa or the hardware driver may be able to provide the value immediately (for example for variables holding constant values like gl_MaxVertexAttribs or gl_MaxDrawBuffers). Otherwise, the driver will probably have to fetch (or generate) the value for the variable from the hardware at program run-time by generating native code that is added to the user program. For example, a geometry shader that uses gl_PrimitiveID will need that variable updated for each primitive processed by the Geometry Shader unit in a draw call. To achieve this, a driver might have to generate native code that fetches the current primitive ID value from the hardware and puts stores it in the register that provides the storage for the gl_PrimitveID variable before the user code is executed.

The GLSL language also defines a number of available built-in functions that must be provided by implementators, like texture(), mix(), or dot(), to name a few examples. The entry point in Mesa’s GLSL compiler for built-in functions
is src/glsl/builtin_functions.cpp.

The method builtin_builder::create_builtins() takes care of registering built-in functions, and just like with built-in variables, not all functions are always available: some functions may only be available in certain shading units, others may only be available in certain GLSL versions, etc. For that purpose, each built-in function is registered with a predicate that can be used to test if that function is at all available in a specific scenario.

Built-in functions are registered by calling the add_function() method, which registers all versions of a specific function. For example mix() for float, vec2, vec3, vec4, etc Each of these versions has its own availability predicate. For instance, mix() is always available for float arguments, but using it with integers requires GLSL 1.30 and the EXT_shader_integer_mix extension.

Besides the availability predicate, add_function() also takes an ir_function_signature, which tells Mesa about the specific signature of the function being registered. Notice that when Mesa creates signatures for the functions it also defines the function body. For example, the following code snippet defines the signature for modf():

ir_function_signature *
builtin_builder::_modf(builtin_available_predicate avail,
const glsl_type *type)
{
ir_variable *x = in_var(type, "x");
ir_variable *i = out_var(type, "i");
MAKE_SIG(type, avail, 2, x, i);

ir_variable *t = body.make_temp(type, "t");
body.emit(assign(t, expr(ir_unop_trunc, x)));
body.emit(assign(i, t));
body.emit(ret(sub(x, t)));

return sig;
}


GLSL’s modf() splits a number in its integer and fractional parts. It assigns the integer part to an output parameter and the function return value is the fractional part.

This signature we see above defines input parameter ‘x’ of type ‘type’ (the number we want to split), an output parameter ‘i’ of the same type (which will hold the integer part of ‘x’) and a return type ‘type’.

The function implementation is based on the existence of the unary operator ir_unop_trunc, which can take a number and extract its integer part. Then it computes the fractional part by subtracting that from the original number.

When the modf() built-in function is used, the call will be expanded to include this IR code, which will later be transformed into native code for the GPU by the corresponding hardware driver. In this case, it means that the hardware driver is expected to provide an implementation of the ir_unop_trunc operator, for example, which in the case of the Intel i965 driver is implemented as a single hardware instruction (see brw_vec4_visitor.cpp or brw_fs_visitor.cpp
in src/mesa/drivers/dri/i965).

In some cases, the implementation of a built-in function can’t be defined at the IR level. In this case the implementation simply emits an ad-hoc IR node that drivers can identify and expand appropriately. An example of this is EmitVertex() in a geometry shader. This is not really a function call in the traditional sense, but a way to signal the driver that we have defined all the attributes of a vertex and it is time to “push” that vertex into the current primitive. The meaning of “pushing the vertex” is something that can’t be defined at the IR level because it will be different for each driver/hardware. Because of that, the built-in function simply injects an IR node ir_emit_vertex that drivers can identify and implement properly when the time comes. In the case of the Intel code, pushing a vertex involves a number of steps that are very intertwined with the hardware, but it basically amounts to generating native code that implements the behavior that the hardware expects for that to happen. If you are curious, the implementation of this in the i965 driver code can be found in brw_vec4_gs_visitor.cpp, in the visit() method that takes an ir_emit_vertex IR node as parameter.

Coming up next

In this post we discussed the parser, which is the entry point for the compiler, and introduced the Mesa IR, the main data structure. In following posts we will delve deeper into the GLSL compiler implementation. Specifically, we will look into the lowering and optimization passes as well as the linking process and the hooks for hardware drivers that deal with native code generation.

 February 28, 2015

Work continues on Typhon. I've recently yearned for a way to study the Monte-level call stacks for profiling feedback. After a bit of work, I think that I've built some things that will help me.

My initial desire was to get the venerable perf to work with Typhon. perf's output is easy to understand, with a little practice, and describes performance problems pretty well.

I'm going to combine this with Brendan Gregg's very cool flame graph system for visually summarizing call stacks, in order to show off how the profiling information is being interpreted. I like flame graphs and they were definitely a goal of this effort.

Maybe perf doesn't need any help. I was able to use it to isolate some Typhon problems last week. I'll use my Monte webserver for all of these tests, putting it under some stress and then looking at the traces and flame graphs.

Now seems like a good time to mention that my dorky little webserver is not production-ready; it is literally just good enough to respond to Firefox, siege, and httperf with a 200 OK and a couple bytes of greeting. This is definitely a microbenchmark.

With that said, let's look at what perf and flame graphs say about webserver performance:

You can zoom in on this by clicking. Not that it'll help much. This flame graph has two big problems:

1. Most of the time is spent in the mysterious "[unknown]" frames. I bet that those are just caused by the JIT's code generation, but perf doesn't know that they're meaningful or how to label them.
2. The combination of JIT and builtin objects with builtin methods result in totally misleading call stacks, because most object calls don't result in new methods being added to the stack.

I decided to tackle the first problem first, because it seemed easier. Digging a bit, I found a way to generate information on JIT-created code objects and get that information to perf via a temporary file.

The technique is only documented via tribal knowledge and arcane blog entries. (I suppose that, in this regard, I am not helping.) It is described both in this kernel patch implementing the feature, and also in this V8 patch. The Typhon JIT hooks show off my implementation of it.

So, does it work? What does it look like?

I didn't upload a picture of this run, because it doesn't look different from the earlier graph! The big [unknown] frames aren't improved at all. Sufficient digging will reveal the specific newly-annotated frames being nearly never called. Clearly this was not a winning approach.

At this point, I decided to completely change my tack. I wrote a completely new call stack tracer inside Typhon. I wanted to do a sampling profiler, but sampling is hard in RPython. The vmprof project might fix that someday. For now, I'll have to do a precise profiler.

Unlabeled HTTP server profile with correct atoms

I omitted the coding montage. Every time a call is made from within SmallCaps, the profiler takes a time measurement before and after the call. This is pretty great! But can we get more useful names?

Names in Monte are different from names in, say, Python or Java. Python and Java both have class names. Since Monte does not have classes, Monte doesn't have a class name. A compromise which we accept here is to use the "display name" of the class, which will be the pattern used to bind a user-level object literal, and will be the name of the class for all of the runtime object classes. This is acceptable.

HTTP server profile with correct atoms and useful display names

Note how the graphs are differently shaped; all of the frames are being split out properly and the graph is more detailed as a result. The JIT is still active during this entire venture, and it'd be cool to see what the JIT is doing. We can use RPython's rpython.rlib.jit.we_are_jitted() function to mark methods as being JIT'd, and we can ask the flame graph generator to colorize them.

HTTP server profile with JIT COLORS HOLY FUCK I CANNOT BELIEVE IT WORKS

Oh man! This is looking pretty cool. Let's colorize the frames that are able to sit directly below JIT entry points. I do this with a heuristic (regular expression).

THE COLORS NEVER STOP, CAN'T STOP, WON'T STOP

This isn't even close to the kind of precision and detail from the amazing Java-on-Illumos profiles on Gregg's site, but it's more than enough to help my profiling efforts.

 February 26, 2015
I recently purchased a 64GB mini SD card to slot in to my laptop and/or tablet, keeping media separate from my home directory pretty full of kernel sources.

This Samsung card looked fast enough, and at 25€ include shipping, seemed good enough value.

Hmm, no mention of the SD card size?

The packaging looked rather bare, and with no mention of the card's size. I opened up the packaging, and looked over the card.

What made it weirder is that it says "made in Taiwan", rather than "Made in Korea" or "Made in China/PRC". Samsung apparently makes some cards in Taiwan, I've learnt, but I didn't know that before getting suspicious.

After modifying gnome-multiwriter's fake flash checker, I tested the card, and sure enough, it's an 8GB card, with its firmware modified to show up as 67GB (67GB!). The device (identified through the serial number) is apparently well-known in swindler realms.

Buyer beware, do not buy from "carte sd" on Amazon.fr, and always check for fake flash memory using F3 or h2testw, until udisks gets support for this.

Amazon were prompt in reimbursing me, but the Comité national anti-contrefaçon and Samsung were completely uninterested in pursuing this further.

In short:

• Test the storage hardware you receive
• Don't buy hardware from Damien Racaud from Chaumont, the person behind the "carte sd" seller account
 February 23, 2015

Some years ago I bought myself a new laptop, deleted the windows partition and installed Fedora on the system. Only to later realize that the system had a bug that required a BIOS update to fix and that the only tool for doing such updates was available for Windows only. And while some tools and methods have been available from a subset of vendors, BIOS updates on Linux has always been somewhat of hit and miss situation. Well luckily it seems that we will finally get a proper solution to this problem.
Peter Jones, who is Red Hat’s representative to the UEFI working group and who is working on making sure we got everything needed to support this on Linux, approached me some time ago to let me know of the latest incoming update to the UEFI standard which provides a mechanism for doing BIOS updates. Which means that any system that supports UEFI 2.5 will in theory be one where we can initiate the BIOS update from Linux. So systems supporting this version of the UEFI specs is expected to become available through the course of this year and if you are lucky your hardware vendor might even provide a BIOS update bringing UEFI 2.5 support to your existing hardware, although you would of course need to do that one BIOS update in the old way.

So with Peter’s help we got hold of some prototype hardware from our friends at Intel which already got UEFI 2.5 support. This hardware is currently in the hands of Richard Hughes. Richard will be working on incorporating the use of this functionality into GNOME Software, so that you can do any needed BIOS updates through GNOME Software along with all your other software update needs.

Peter and Richard will as part of this be working to define a specification/guideline for hardware vendors for how they can make their BIOS updates available in a manner we can consume and automatically check for updates. We will try to align ourselves with the requirements from Microsoft in this area to allow the vendors to either use the exact same package for both Windows and Linux or at least only need small changes to them. We can hopefully get this specification up on freedesktop.org for wider consumption once its done.

I am also already speaking with a couple of hardware vendors to see if we can pilot this functionality with them, to both encourage them to support UEFI 2.5 as quickly as possible and also work with them to figure out the finer details of how to make the updates available in a easily consumable fashion.

Our hope here is that you eventually can get almost any hardware and know that if you ever need a BIOS update you can just fire up Software and it will tell you what if any BIOS updates are available for your hardware, and then let you download and install them. For people running Fedora servers we have had some initial discussions about doing BIOS updates through Cockpit, in addition of course to the command line tools that Peter is writing for this.

I mentioned in an earlier blog post that one of our goals with the Fedora Workstation is to drain the swamp in terms of fixing the real issues making using a Linux desktop challenging, well this is another piece of that puzzle and I am really glad we had Peter working with the UEFI standards group to ensure the final specification was useful also for Linux users.

Anyway as soon as I got some data on concrete hardware that will support this I will make sure to let you know.

 February 22, 2015
So, I finally figured out the bug that was causing some incorrect rendering in xonotic (and, it turns out, to be the same bug plaguing a lot of other games/webgl/etc).  The fix is pushed to upstream mesa master (and I guess I should probably push it to the 10.5 stable branch too).  Now that xonotic renders correctly, I think I can finally call freedreno a4xx support usable:

Also, for fun, a little comparison between the ifc6540 board (snapdragon 805, aka apq8084), and my laptop (i5-4310U).  Both have 1920x1080 resolution, both running gnome-shell and firefox (with identical settings).  Laptop is fedora f21 while ifc6540 is rawhide), but it is quite close to an apples-to-apples comparision:

Obviously not a rigorous benchmark, so please don't read too much into the results.  The intel is still faster overall (as it should be at it's size/price/power budget), but amazing that the gap is becoming so small between something that can be squeezed into a cell phone and dedicated laptop class chips.