planet.freedesktop.org
February 19, 2024

Hi! February is FOSDEM month, and as usual I’ve come to Brussels to meet with a lot of other FOSS developers and exchange ideas. I like to navigate between the buildings and along the hallways to find nice people to discuss with. This edition I’ve been involved in the new modern e-mail devroom and I’ve given a talk about IMAP with Damian, a fellow IMAP library maintainer and organizer of this devroom. The whole weekend was great!

In wlroots news, I’ve worked on multi-connector atomic commits. Right now, wlroots sequentially configures outputs, one at a time. This is slow and makes it impossible to properly handle GPU limitations such as bandwidth: if the GPU cannot drive two outputs with a 4k resolution, we’ll only find out after the first one has been lit up. As a result we can’t properly implement fallbacks and this results in black screens on some setups. In particular, on Intel some users need to set WLR_DRM_NO_MODIFIERS=1 to have their multi-output setup work correctly. The multi-connector atomic commit work is the first step to resolve these situations and also results in faster modesets. The second step will be to add fallback logic to use a less bandwidth-intensive scanout buffer on modeset.

While working on the wlroots DRM backend code, I’ve also taken the opportunity to cleanup the internals and skip unnecessary modesets when switching between VTs. Ctrl Alt 1 should be faster now! I’ve also tried to resurrect the ext-screencopy-v1 protocol, required for capturing individual windows. I’ve pushed a new version and reworked the wlroots implementation, hopefully I can find some more time next month to continue on this front.

Sway 1.9-rc4 has been recently released, my reading of the tea leaves at my disposal indicates that the final release may be shipped soon. Sway 1.9 will leverage the new wlroots rendering API, however it does not include the huge scene-graph rework that Alexander has pushed forward in the last year or so. Sway 1.10 will be the first release to include this major overhaul and all the niceties it unlocks. And Sway 1.10 will also finally support input method popups (used for CJK among other things) thanks to efforts by Access and Tadeo Kondrak.

The NPotM is sinwon, a simple OAuth 2 server for small deployments. I’ve long been trying to find a good solution to delegate authentication to a single service and provide single-sign-on for my personal servers. I’ve come to like OAuth 2 because it’s a standard, it’s not tied to another use-case (like IMAP or SMTP is), and it prevents other services from manipulating user passwords directly. sinwon stores everything in a SQLite database, and it’s pretty boring: no fancy cryptography usage for tokens, no fancy cloud-grade features. I like boring. sinwon has a simple UI to manage users and OAuth clients (sometimes called “apps”). Still missing are refresh tokens, OAuth scopes, an audit log, personal access tokens, and more advanced features such as TOTP, device authorization grants and mTLS. Patches welcome!

I’ve continued my work to make it easier to contribute to the SourceHut codebase. Setting up PGP keys is now optional to run a SourceHut instance, and a local S3-compatible server (such as minio) can be used without TLS. Thorben Günther has added paste.sr.ht to sr.ht-container-compose. I’m also working on making services use meta.sr.ht’s GraphQL API instead of maintaining their own copy of the user’s profile, but more needs to be done there.

And now for the random collection of smaller updates… The soju IRC bouncer and the goguma IRC client for mobile devices now support file uploads: no need to use an external service anymore to share a screenshot or picture in an IRC conversation. Conrad Hoffmann and Thomas Müller have added support for multiple address books to the go-webdav library, as well as creating/deleting address books and calendars. I’ve modernized the FreeDesktop e-mail server setup with SPF, DKIM and DMARC. KDE developers have contributed a new layer-shell minor version to support docking their panel to a corner of the screen.

That’s all for now, see you next month!

February 16, 2024

In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to well on my way towards becoming fit, while putting in a minimum of effort. On the fitness side, I’ve taken my cardiorespiratory fitness from below average to above average, and I’m visibly stronger (I can do multiple pull-ups!). Again, I’ve aimed to do so with minimal effort to maximize my efficiency.

Here’s what I wrote in my prior post on weight loss:

I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. And a year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age.

Research shows that 5 factors are key to a long life — extending your life by 12–14 years:

  • Never smoking
  • BMI of 15.5–24.9
  • 30+ min a day of moderate/vigorous exercise
  • Moderate alcohol intake (vs none, occasional, or heavy)
    • Unsurprisingly, there is vigorous scientific and philosophical/religious/moral debate about this one, however all studies agree that heavy drinking is bad.
  • Diet quality in the upper 40% (Alternate Healthy Eating Index)

In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending.

Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. I do drink moderately, however (almost entirely beer).

This post accompanies my earlier writeup, “The lazy technologist’s guide to weight loss.” Check that out for an in-depth, science-driven review of my experience losing weight. 

Why is this the lazy technologist’s guide, again? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.”

What’s the lowest-effort, most research-driven way to become fit as quickly as possible, during and after losing weight? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path.

Cardio Fitness

My initial goal for fitness was simply to meet the “30+ min/day” factor in the research study I cited at the beginning of this post, while considering a few factors:

  • First, this is intended to be the lazy way, so there should be no long and intense workouts unless unavoidable. 
  • Second, I did not want to buy a bunch of equipment or need to pay for a gym membership. Any required equipment should be inexpensive and small.
  • Third, I wanted to avoid creating any joint issues that would affect me negatively later in life. I was particularly concerned about high-impact, repetitive stress from running on hard surfaces, which I’d heard could be problematic.

Joint issues become very common for older people, especially knees and hips. My program needed to avoid any high-impact, repetitive stress on those joints to preserve maximum function. I’ve always heard that running is bad on your knees, but after I looked into it, the research does not bear that out. And yet, it remains a popular misconception among both the general population as well as doctors who do not frequently perform hip replacements.

However, I just don’t like running — I enjoy different activities if I’m going to be working hard physically, such as games like racquetball/squash/pickleball or self-defense (Krav Maga!). I’m also not a big fan of getting all sweaty in general, but especially in the middle of a workday. So I wanted an activity with a moderate rather than high level of exertion.

Low-impact options include walking, cycling, swimming, and rowing, among others. But swimming requires an indoor pool or year-round good weather, and rowing requires a specialized machine or boat, while I’m aiming to stay minimal. I also do not own a bicycle, nor is the snowy weather in Minnesota great for cycling in the winter (fat-tire bikes being an exception).

We’re left with walking as the primary activity. 

LISS — Low-Intensity Steady State

Initially, I started with only walking. This is called low-intensity steady state (LISS) cardio (cardiovascular, a.k.a. aerobic) exercise. Later, I also incorporated high-intensity interval training (HIIT) as the laziest possible way to further improve my cardiovascular health.

To bump walking up into a “moderate” level of activity, I need to walk between 3–4 mph. This is what’s sometimes called a “brisk” walk — 3 mph feels fast, and 4 mph is about as fast as I can go without changing into some weird competitive walking style.

I also need to hit 30+ minutes per day of this brisk walking. At first, I started on a “walking pad” treadmill under my standing desk, which I bought for <$200 on Amazon. My goal was to integrate walking directly into my day with no dedicated time, and this seemed like a good path. However, this violates the minimalism requirement. I also learned that the pace is also too fast to do much of anything at the desk besides watch videos or browse social media. So I broke this up into two 1-mile outdoor walks, one after lunch and another after dinner. 

Each 1-mile walk takes 15–20 minutes. Fitting this into a workday requires me to block off 45–60 minutes for lunch, between lunch prep, time to eat, and the walk itself. I find this much easier than trying to create a huge block of time in the morning for exercise, because I do not naturally wake up early. In the evening, I’ll frequently extend the after-dinner walk to ~2 miles instead of 1 mile.

It turns out that walking after meals is a great strategy for both weight loss and suppressing your blood sugar levels, among other benefits. This can be as short as a 2-minute walk, according to recent studies. In fact, it’s seen as so key in Mediterranean culture that walking is considered a component of the Mediterranean diet.

Overall, I’ve increased my active calorie consumption by 250 calories/day by incorporating active walks into my day. That’s a combination of the 2 after-meal brisk walks, plus a more relaxed walk on my under-desk treadmill sometime during the day. The latter is typically a 2 mph walk for 40–60 min, and I do it while I’m in a meeting that I’m not leading, or maybe watching a webinar. Without buying the walking pad, you could do the same on a nice outdoor walk with a headset or earbuds, but Minnesota weather sometimes makes that miserable. Overall, all of this typically gets me somewhere between 10,000–15,000 steps per day. 

Not only is this good for fitness, it also helps to offset the effects of metabolic adaptation. If you’re losing weight, your body consumes fewer calories because it decreases your resting metabolic rate to conserve energy. Although some sites will suggest this could be hundreds of calories daily, which is quite discouraging, research shows that’s exaggerated for most people. During active weight loss, it’s typically ~100 calories per day, although it may be up to 175±150 calories for diet-resistant people. That range is a standard deviation, so people who are in the worst ~15% of the diet-resistant subset could have adaptations >325 calories/day. So if you believe you’re diet-resistant, you probably want to aim for a 1000-calorie deficit, to ensure you’re able to lose weight at a good rate. On the bright side, that adaptation gets cut in half once you’ve stabilized for a few weeks at your new weight, and it’s effectively back to zero a year later.

To further maintain my muscle following weight loss, I added a weighted vest to my after-lunch walks occasionally (examples: Rogue, 5.11, TRX). I started doing this once a week, and I aim to get to 3x+/week. I use a 40 lb weighted vest to counterbalance the 40+ lb of weight that I’ve lost. When I walk with the vest, I’m careful to maintain the same pace as without the vest, which increases the intensity and my heart rate. This pushes a normal moderate-intensity walk into the low end of high intensity (approaching 80% of my max heart rate). I also anticipate incorporating this weighted vest into my strength training later, once my own body weight is insufficient for continued progression. 

Considering a minimalist approach, however, I think you could do just fine without a weighted vest. There are other ways to increase intensity, such as speed or inclines, and the combination of a high-protein diet, HIIT, and strength training provides similar benefits.

HIIT — High-Intensity Interval Training

Why do HIIT? Regularly getting your heart rate close to its maximum is good for your cardiovascular health, and you can’t do it with LISS, which by definition is low intensity. Another option besides HIIT is much longer moderate-intensity continuous training (your classic aerobic workout), but HIIT can fit the same benefits or more into a fraction of the time.

Research is very supportive of HIIT compared to longer aerobic workouts, which enables time compression of the total workout length from the classic 60 minutes down to 30 minutes or less. 

However, 30 minutes still isn’t the least you can do and still get most of the benefits. The minimum required HIIT remains unclear — in overall length, weekly frequency, as well as patterns of high-intensity and rest / low-intensity. Here are some examples of research that test the limits of minimalist HIIT and find that it still works well:

Yes, you read that right — the last study used 20-second intervals. They were only separated by 10 seconds of rest, so the primary exercise period was just 4 minutes, excluding warm-up. Furthermore, this meta-analysis suggests that HIIT benefits more from increasing the intensity of the high-intensity intervals, rather than increasing the volume of repetitions.

After my investigation, it was clear that “low-volume” or “extremely low volume” HIIT could work well, so there was no need to do the full 30-minute HIIT workouts that are popular with many gym chains. 

I settled on 3 minutes of HIIT, 2x/week: 3 repetitions of 30 seconds hard / 30 seconds light, plus a 1-minute warm-up. This overlaps with the HIIT intervals, breaks, and repetitions from the research I’ve dug into, and it also has the convenient benefit of not quite making me sweat during the workout, so I don’t need to change clothes. 

I’m seeing the benefits of this already, which I’ll discuss in the Summary.

Strength Training

I also wanted to incorporate strength training for many reasons. In the short term, it was to minimize muscle loss as I lost weight (addressed in my prior post). In the medium and long term, I want to build muscle now so that I can live a healthier life once I’m older and also feel better about myself today.

What I’ve found is that aiming for the range of 10%–15% body fat is ideal for men who want to be very fit. This range makes it easy to tell visually when you’re at the top or bottom of the range, based on the appearance of a well-defined six-pack or its fading away to barely visible. It gets harder to tell where you are visually from 15% upwards, while anything below 10% has some health risks and starts to look pretty unusual too.

Within that 10%–15% range, I’m planning to do occasional short-term “lean bulks” / “clean bulks” and “cuts.” That’s the typical approach to building muscle — you eat a slight excess of calories while ensuring plenty of protein, aiming to gain about 2–4 lbs/month for someone my size. After a cycle of doing this, you then “cut” by dieting to lose the excess fat you’ve gained, because it’s impossible to only gain muscle. My personal preference is to make this cycle more agile with shorter iteration cycles, compared to some of the examples I’ve seen. I’m thinking about a 3:1 bulk:cut split over 4 months that results in a total gain/loss of ~10 lbs.

Calisthenics (bodyweight exercises): the minimalist’s approach

My goal of staying minimal pushed me toward calisthenics (bodyweight exercises), rather than needing to work out at a gym or buy free weights. This means the only required equipment is a doorway pull-up bar ($25), while everything else can be done with a wall, table or chair/bench. Although I may not build enormous muscles, it’s possible to get to the point of lifting your entire body weight with a single arm, which is more than good enough for me. That’s effectively lifting 2x your body weight, since you’re lifting 1x with just one arm.

My routine is inspired by Reddit’s r/bodyweightfitness (including the Recommended Routine and the Minimalist Routine) and this blog post by Steven Low, author of the book “Overcoming Gravity.” I’ve also incorporated scientific research wherever possible to guide repetitions and frequency. Overall, the goal is to get both horizontal and vertical pushing and pulling exercises for the arms/shoulders due to their larger range of motion, while getting push and pull for legs, and good core exercises that cover both the upper and lower back as well. 

I’ve chosen compound exercises that work many muscles simultaneously — for practicality (more applicable to real-world motions), length of workout, and minimal equipment needs. If you’re working isolated muscles, you generally need lots of specialized machines at a gym. Isometrics (exercises where you don’t move, like a wall-sit) are also less applicable to real use cases as you age, such as the strength and agility to catch yourself from a fall. For that reason, I prefer compound exercises with some rapid, explosive movements that help to build both strength and agility.

My initial routine

Here’s my current schedule (3 sets of repetitions for each movement, with a 3-minute break between sets):

  • Monday: arm push — push-ups (as HIIT) and tricep dips. “As HIIT” means that I’ll do as many push-ups as I can fit within my HIIT pattern, then flip to my active work (e.g. jumping jacks or burpees).
  • Tuesday: arm pull — pull-ups (with L-sit, as below) and inverted rows (“Australian pull-ups”)
  • Wednesday: core — L-sits, planks (3x — 10 sec on each of front, right, left)
  • Thursday: handstands — working toward handstand push-ups as the “vertical push”
  • Friday: legs — squats (as HIIT), and Nordic curls (hamstrings & lower back)
  • Saturday/Sunday: rest — just walking. Ideally hitting 10k steps/day but no pressure to do so, if I’m starting to feel sore.

For ones that I couldn’t do initially (e.g. pull-ups, handstands, L-sits, Nordic curls), I used progressions to work my way there step by step. For pull-ups, that meant doing negatives / eccentrics by jumping up and slowly lowering myself down over multiple seconds, then repeating. For handstands, I face the wall to encourage better posture, so it’s been about longer holds and figuring out how to bail out so I can more confidently get vertical. For L-sits, I follow this progression. For Nordic curls, I’m doing slow negatives as far down as I can make it, then dropping the rest of the way onto my hands and pushing back up.

On days with multiple exercises for the same muscles, I’ll typically try to split them up so they fit more easily into a workday. For example, I’ll find 10 minutes mid-morning between meetings/calls to do one movement and 10 minutes mid-afternoon for the other. This is the same time I might’ve spent making a coffee, before I started focusing on fitness.

Combined with the walks, this plan gets me moving 4 times a day — two 20-minute walks and two 10-minute workouts, for a total of 1 hour each day. The great thing about this approach is that I never feel like I need to dedicate a ton of time to exercise, because it fits naturally into the structure of my day. I’ve also got an additional 40–60 minutes of slow walking while at my desk, which again fits easily into my day.

What I’ve learned along the way

As you can see, I’m currently at 1x/wk for non-core exercises, which is a “traditional split.” That means I’m splitting up exercises, focusing on just one set of muscles each day. The problem is that the frequency of training for each muscle group is low, which I’d like to change so that I can build strength more quickly. 

I’m switching to “paired sets” (aka “alternating sets”) that alternate among different muscle groups, so I can fit more into the same amount of time. Here’s how that works: if you were taking a 3-minute rest between sets, that gives you time to fit in an unrelated set of muscles that you weren’t using in the first exercise (e.g. biceps & triceps, quads & hamstrings, chest & back). I do this as an alternating tri-set (arm pull, arm push, legs) with a 30–45 second rest between each muscle group, and a 1.5–2 minute break between each full tri-set. You might also see “supersets,” which is a similar concept but with no breaks within the tri-set. I’ve found that I tend to get too tired and sloppy if I try a superset, so I do alternating sets instead.

In addition, I’ve done a lot more research on strength training after getting started. For LISS and HIIT, I had a strongly research-driven approach before beginning. For strength training, I went with some more direct recommendations and only did additional academic research later. Here’s what I’ve learned since then:

  • Higher-load (80%+), multi-set workouts 2x/week are optimal for maximizing both strength and hypertrophy, according to a 2023 meta-analysis.
  • One ideal size of a set to maximize benefits seems to be 6-8 repetitions, with a 3-minute break between sets to maximize energy restoration. 6-8 reps seems like a sweet spot between strength and hypertrophy (muscle size). For endurance, 15+ repetitions should be the goal. If you want to build all of those characteristics, you should probably alternate rep counts with different loads.
  • Time efficient workout design: Use compound exercises and include both concentric & eccentric movements. Perform a minimum of one leg-pressing exercise (e.g. squats), one upper-body pulling exercise (e.g. pull-up) and one upper-body pushing exercise (e.g. push-up). Perform a minimum of 4 weekly sets per muscle group using a 6–15 rep max loading range.
  • Eccentric / negatives are superior to concentric. Don’t neglect or rush through the negatives / eccentrics. That’s the part of an exercise you ignore by default — letting your weight come down during a squat, pull-up, or push-up rather than when you’re pushing/pulling it back up. Take your time on that part, because it’s actually more important.
  • Doing something as quick as 3-second negatives, 4x/wk, will improve strength.

Overall, that suggests a workout design that looks like this (2 days a week):

  • 2+ sets of each: Compound exercises for arm push, arm pull, leg press
  • Aim for whatever difficulty is required to max out at 6–8 repetitions for strength & hypertrophy (muscle size), or up to 15 if you’re focusing on endurance
  • Do slow eccentrics / negatives on every exercise

The new routine

To incorporate this research into a redesigned routine that also includes HIIT and core work, here’s what I’ve recently changed to (most links go to “progressions” that will help you get started):

  • Monday: Strength: push-ups, pull-ups, squats as alternating set
  • Tuesday: HIIT (burpees, mountain climbers, star jumps, etc)
  • Wednesday: Core & Flexibility: L-sits, planks, Nordic curls, stretches
  • Thursday: HIIT (similar routine)
  • Friday: Strength: handstand push-ups, inverted rows, squats as alternating set
  • Saturday/Sunday: Rest days

Also, 4+ days a week, I do a quick set of a 5-second negative for each type of compound exercise (arm push, arm pull, leg press). That’s just 2 days in addition to my strength days, so I usually fit it into HIIT warm-up or cool-down.

On each day, my overall expected time commitment will be about 10 minutes. For strength training, all the alternating sets will overlap with each other. Even with a 3-min break between each set for the same muscle group, that should run quite efficiently for 2–3 sets. For HIIT, it’s already a highly compressed routine that takes ~5 minutes including warm-up and cool-down, but I need another 5 minutes afterwards to decompress after exercise that intense. You may notice that I only have one dedicated day to work my core (Wednesday), but I’m also getting core exercise during push-ups (as I plank), L-sit pull-ups, and handstands (as I balance).

The research recommendation to increase load to 80% of your max can seem more challenging with calisthenics, since it’s just about bodyweight. However, it’s always possible by decreasing your leverage, using one limb instead of two, or increasing the proportion of your weight that’s applied by changing your body angles. For example, you can do push-ups at a downwards incline with your feet on a bench/chair. You can also do more advanced types of squats like Bulgarian split squats, shrimp squats, or pistol squats.

Summary

My cardiorespiratory fitness, as measured by VO2 Max (maximal oxygen consumption) on my Apple Watch, has increased from 32 (the lowest end of “below average,” for my age & gender) to 40.1 (above average). It continues to improve on a nearly daily basis. That’s largely happened within just a couple of months, since I started walking every day and doing HIIT. 

My blood pressure (one of my initial concerns) has dropped out of pre-hypertension into the healthy range. My resting heart rate has also decreased from 63 to 56 bpm, which was a long slow process that’s occurred over the entire course of my weight loss.

On the strength side, I wasn’t expecting any gains because I’m in a caloric deficit. My main goal was to avoid losing muscle while losing weight. I’ve now been strength training for 2.5 months, and I’ve been pleasantly surprised by the “newbie gains” (which people often see in their first year or two of strength training). 

For example, I couldn’t do any pull-ups when I started. I could barely do a couple of negatives, by jumping up and letting myself down slowly. Now I can do 4 pull-ups (neutral grip). Also, I can now hold a wall handstand for 30–45 seconds and do 6–8 very small push-ups, while I could barely get into that position at all when I started. 

Overall, clear results emerged almost instantly for cardiorespiratory fitness, and as soon as 6 weeks after beginning a regular strength-training routine. If you try it out, let me know how it works for you!

[Last update: 2024-02-16]

In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to becoming much more fit, while putting in a minimum of effort. I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. A year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age. 

I wasn’t willing to let this last any longer, and I wasn’t willing to accept that future.

Research shows that 5 factors are key to a long life — correlated with extending your life by 12–14 years:

  • Never smoking
  • BMI (body mass index) of 18.5–24.9
  • 30+ min a day of moderate/vigorous exercise
  • Moderate alcohol intake (vs none, occasional, or heavy)
  • Diet quality in the upper 40% (Alternate Healthy Eating Index)

In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending.

Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. On the bright side for these purposes, I drink moderately (almost entirely beer).

In this post, I’ll walk through my own experience going from obese to a healthy weight, with plenty of research-driven references and data along the way.

Why is this the lazy technologist’s guide, though? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.”

What’s the lowest-effort, most research-driven way to lose weight as quickly as possible without losing health? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path.

My weight-loss journey begins

My initial goal was to get down from 240 pounds (obese, BMI of 31.7) into the healthy range, reaching 185 pounds (BMI of 24.4). 

My aim was to lose at the high end of a healthy rate, 2 pounds per week. Credible sources like the Mayo Clinic and the CDC suggested aiming for 1–2 pounds a week, because anything beyond that can cause issues with muscle loss as well as malnutrition.

But how could I accomplish that?

One weird trick — Eat less

I’ve lost weight once previously (about 15 years ago), although it was a smaller amount. Back then, I learned that there’s no silver bullet — the trick is to create a calorie deficit, so that your body consumes more energy than the calories in what you eat. 

Every pound is about 3500 calories, which helps to set a weekly and daily goal for your calorie deficit. For me to lose 2 pounds a week, that’s 2*3500 = 7000 calories/week, or 1000 calories/day of deficit (eating that much less than my body uses).

Exercise barely makes a dent

It’s far more effective and efficient to create this deficit primarily through eating less rather than expecting exercise to make a huge difference. If you were previously gaining weight, you might’ve been eating 3000 calories/day or more! You can easily reduce what you eat by 1500 calories/day from that starting point, but it’s almost impossible to exercise enough to burn that many calories. An hour of intense exercise might burn 500 calories, and it’s very hard to keep up that level of effort for even one full hour — especially if you’ve been sitting in a chair all day for years on end.

Not to mention, that much exercise would defeat the whole idea of this being the lazy person’s way of making progress.

So how exactly can you reduce calories? You’ve got a lot of options, but they basically boil down to two things — eat less (portion control), and eat better (food choice).

The plan

At this point, I knew I needed to eat 1000 calories/day less than I burned. I used this calculator to identify that, as a sedentary person, I burned about 2450 calories/day. So to create that deficit, I needed to eat about 1450 calories/day. At that point, I was probably eating 2800–3000 calories/day, so that would require massive changes in my diet.

I don’t like the idea of fad diets that completely remove one or many types of foods entirely (Atkins, keto, paleo, etc), although they can work for other people. One of those big lessons about dieting is that as long as you’re removing something from what you eat, you’ll probably lose weight. 

I decided to make two big changes: how often I ate healthy vs unhealthy food, and when I ate over the course of the day. At the time, I was eating a huge amount of high-fat, high-sugar, and low-health foods like burgers and fries multiple times per week, fried food, lots of chips/crisps, white bread (very high sugar in the US) & white rice, cheese, chocolate and candy. 

I decided to shift that toward white meat (chicken/pork/turkey), seafood, salads & veggies, and whole grains (whole-wheat bread, brown rice, quinoa, etc). One pro-tip: American salad dressings are super unhealthy, often even the “vinaigrettes” that sound better. Do like Italians do, and dress salads yourself with olive oil, salt, and vinegar. However, I didn’t want to remove my favorite foods entirely, because that would destroy my long-term motivation and enjoyment of my progress. For example, once a week, I still allow myself to get a cheeseburger. But I’ll typically get a single patty, no mayo/cheese/ketchup, and with a side like salad (w/ healthy dressing) or cole slaw. I’ll also ensure my other meal of the day is very light. Many days, I’ll enjoy a small treat like 1–2 chocolates, as well (50–100 calories).

What if you like beer?

I wanted to reach my calorie target without eliminating beer, so I could both preserve my quality of life and also maintain the moderate drinking that research shows is correlated with increased lifespan. 

I was also drinking very high-calorie beer (like double IPAs and bourbon-barrel–aged imperial stouts). I shifted that toward low-alcohol, low-calorie beer (alcohol levels and calories are correlated). Bell’s Light-Hearted IPA and Lagunitas DayTime IPA are two pretty good ones in my area. Of the non-alcoholic (NA) beers, Athletic Free Wave Hazy IPA is the best I’ve found in my area, but Untappd has reasonably good ratings for Sam Adams Just the Haze and Sierra Nevada Trail Pass IPA, which should be broadly available. As a rough estimate on calories in beer, you can use this formula:

Beer calories = ABV (alcohol percentage) * 2.5 * fluid ounces

As an exception, many Belgian beers are quite “efficient” to drink, in that roughly 75% of the calories are alcohol rather than other carbs that just add calories. As a result, they violate the above formula and tend to be lower-calorie than you’d expect. This could be the result of carefully crafted recipes that consume most of the carbs, and fermentation that uses up all of the sugar. 

Here’s a more specific formula that you can use, if you’re curious about how “efficient” a given beer is, and you know how many total calories it has (find this online):

Beer calories from ethanol = (ABV * 0.8 / 100) * (29.6 * fluid ounces) * 7

(Simplified form): Beer calories from ethanol = ABV * 1.7 * fluid ounces

This uses the density and calories of ethanol (0.8 g/ml and 7 cal/g, respectively) and converts from milliliters to ounces (29.6 ml/oz). If you then calculate that number as a fraction of the total calories in a beer, you can find its “efficiency.” For example, a 12-ounce bottle of 8.5% beer might have 198 calories total. Using the equation, we can calculate that it’s got 169 calories from ethanol, so 169/198 = 85% “efficient.”

If you’re really trying to optimize for this, however, beer is the wrong drink. Have a low-calorie mixed drink instead, like a vodka soda, ranch water, or rum and Diet Coke.

The plan (part 2)

Therefore, instead of giving up beer entirely, I decided to skip breakfast. I’d eaten light breakfasts for years (a small bowl of cereal, or a banana and a granola bar), so this wasn’t a big deal to me. 

Later, I discovered this qualified my diet as time-restricted intermittent fasting as well, since I was only eating/drinking between ~12pm–6pm. This approach of 18 hours off / 6 hours on (18:6 fasting) may have aided in my weight loss, but studies are mixed with some suggesting no effect.

Here’s what a day might look like on 1450 calories:

  • Lunch (400 calories). A tuna-salad sandwich (made with Greek yogurt instead of mayo) on whole-wheat bread, and a side salad with olive oil & vinegar.
  • Afternoon snack (150 calories). Sliced bell peppers, no dip, and a small bowl of cottage cheese.
  • A treat (50–100 calories). A truffle or a couple of small chocolates as an afternoon treat.
  • Dinner (650 calories). Fried chicken/fish sandwich (or kids-size burger) and a small order of fries, from a fast-casual restaurant.
  • One or two low-alcohol, light, or NA beers (150–200 calories).

When I get hungry, I often drink some water instead, because my body’s easily confused about hunger vs thirst. It’s a mental game too — I remind myself that hunger means my body is burning fat, and that’s a good thing.

For a long time, I kept track of my estimated calorie consumption mentally. More recently, I decided to make my life a little easier by switching to an app. I chose MyFitnessPal because it’s got a big database including almost everything I eat.

On this plan, I had a great deal of success in losing my first 40 pounds, getting down from 240 to 200. However, it started to feel like a bit of a struggle to maintain my weight loss as I reached 200 pounds and wanted to continue losing at the same rate of 2 pounds/week.

Adaptation, plateaus and persistence

I fell behind by about two weeks on my weight-loss goal, which was massively frustrating because I’d done so well all along. I convinced myself to keep persisting because it had worked all along for months, and this was a temporary setback.

Finally I re-used the same weight-loss calculator and realized what seemed obvious in hindsight: Since I now weighed less, I also burned fewer calories per day! Those 40 pounds that were now gone didn’t use any energy anymore, but I was still eating as if I had them. I needed to change something to restore the 1000-calorie daily deficit. 

At this point, I aimed to decrease my intake to about 1200 calories per day. This quickly became frustrating because it started to affect my quality of life by forcing choices I didn’t want to make, such as choosing between a decent dinner or a beer, or forcing me to eat a salad with no protein for dinner if I had a little bit bigger lunch.

That low calorie limit also carried the risk of causing metabolic adaptation — meaning my body could burn hundreds fewer calories per day as a result of being in a “starvation mode” of sorts. That ends up being a vicious cycle that continually forces you to eat less, and it makes weight loss even more challenging.

Consequently, I began to introduce moderate exercise (walking), so I could bring my intake back up to 1400 calories on days when I burned 200 extra calories. I’ve discussed the details in a follow-up guide for fitness.

Over the course of my learning, I discovered that it’s ideal (according to actuarial tables) to sit in the middle of the healthy range rather than be at the top of it. I maintained my initial weight-loss goal to keep myself motivated on progress, but set a second goal of reaching 165 pounds — or whatever weight it takes to get a six-pack (~10% body fat).

Eat lots of protein

I also discovered that high-protein diets are better at preserving muscle, so more of the weight loss is fat. This is especially true when coupled with resistance or strength training, which also sends your body a signal that it needs to keep its muscle instead of losing it. The minimum recommended daily allowance (RDA) of protein (0.36 grams per pound of body weight, or 67 g/day for me) could be your absolute lower limit, while as much as 0.6 g/lb (111 g/day for me) could help in improving your muscle mass. 

Another study suggested multiplying the RDA by 1.25–1.5 (or more if you exercise) to maintain muscle during weight loss, which would put my recommended protein at 84–100 grams per day. The same study also said exercise helps to maintain muscle during weight loss, so it could be an either/or situation rather than needing both. Additionally, high-protein diets can help with hunger and weight loss, in part because they keep you fuller for longer. Getting 25%–30% of daily calories from protein will get you to this level, which is a whole lot of protein. Starting from your overall daily calories, you can apply this percentage and then divide your desired protein calories by 4 to get the number of grams per day:

Protein grams per day = Total daily calories * {25%, 30%} / 4

For my calorie limit, that’s about 88–105 grams per day. 

I’ve found that eating near the absolute minimum recommended protein level (67 grams per day, for my weight) tends to happen fairly naturally with my originally planned diet, while getting much higher protein takes real effort. I needed to identify low-calorie, high-protein foods and incorporate them more intentionally into meals, so that I can get enough protein without compromising my daily calorie limit. 

Here’s a good list of low-calorie, high-protein foods that are pretty affordable:

  • Breakfast/Lunch: eggs or low-fat/nonfat Greek yogurt (with honey/berries), 
  • Entree: grilled/roasted chicken (or pork/turkey) or seafood (especially shrimp, canned salmon, canned tuna), and
  • Sides: cottage cheese or lentils/beans (including soups, to make it an entree).

If you’re vegetarian, you’d want to go heavier on lentils and beans, and add plenty of nuts, including hummus and peanut butter. You probably also want to bring in tempeh, and you likely already eat tofu.

I’d never tried canned salmon before, and I was impressed with how easily I could make it into a salad or an open-faced sandwich (like Danish smørrebrød). The salmon came in large pieces and retained the original texture, as you’d want. Canned tuna has been more variable in terms of texture — I’ve had some great-looking albacore from Genova and some great-tasting (but not initially good-looking) skipjack from Wild Planet.

Avoid the most common brands of canned fish though, like Chicken of the Sea, StarKist, or Bumble Bee. They are often farmed or net-caught instead of pole/line-caught, and they may be higher in parasites (for farmed fish like salmon). I also aim to buy lower-mercury types of salmon and tuna — this means I can eat each kind of fish as often as I want, instead of once a week. I buy canned Wild Planet skipjack tuna (not albacore, but yellowfin is pretty good too) and canned Deming’s sockeye salmon (not pink salmon) at my local grocery store, and I pick up large trays of refrigerated cocktail shrimp at Costco. The Genova brand also garners good reviews for canned fish and may be easier to find. All of those are pre-cooked and ready to eat, so they’re easy to use for a quick lunch. 

Go ahead and get fresh seafood if you want, but be aware that you’ll be going through a lot of it so it could get expensive. Fish only stays good for a couple of days unless frozen, so you’ll also be making a lot of trips to the store or regularly thawing/cooking frozen fish.

Summary

Over the past 8 months, I’ve managed to lose 60 pounds (and counting!) through a low-effort approach that has minimized the overall impact on my quality of life. I’ve continued to eat the foods I want — but less of them.

The biggest challenge has been persistence through the tough times. However, not cutting out any foods completely, but rather just decreasing the frequency of unhealthy foods in my life, has been a massive help with that. That meant I didn’t feel like I was breaking my whole diet whenever I had something I really wanted, as long as it fit within my calorie limit.

What’s next? A few months after beginning my weight loss, I also started working out to get into better shape, which was another one of those original 5 factors to a long life. Right now, I’m aiming to get down to about 10% body fat, which is likely to be around 165 pounds. Then I’ll flip my eating habits into muscle-building mode, which will require a slight caloric excess rather than a deficit. 

Stay tuned to see what happens!

Igalia is preparing the 2024 Linux Display Next Hackfest and we are thrilled to announce that this year’s hackfest will take place from May 14th to 16th at our HQ in A Coruña, Spain.

This unconference-style event aims to bring together the most relevant players in the Linux display community to tackle current challenges and chart the future of the display stack.

Key goals for the hackfest include:

  • Releasing the power of collaboration: We’ll work to remove bottlenecks and pave the way for smoother, more performant displays.
  • Problem-solving powerhouse: Brainstorming sessions and collaborative coding will target issues like HDR, color management, variable refresh rates, and more.
  • Building on past commitments: Let’s solidify the progress made in recent years and push the boundaries even further.

The hackfest fosters an intimate and focused environment to brainstorm, hack, and design solutions alongside fellow display experts. Participants will dive into discussions, tinker with code, and contribute to shaping the future of the Linux display stack.

More details are available on the official website.

Stay tuned! Keep an eye out for more information, mark your calendars and start prepping your hacking gear.

February 14, 2024

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers.

Already installed? Just dnf upgrade --refresh.

Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra.

Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2.

While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan.


Compared to 4.1, OpenGL 4.6 adds dozens of required features, including:

Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set.

How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on.

For a taste of the challenges we overcame, let’s look at robustness.

Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance.

For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”.

“Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface.

All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility.

Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled:

  • Zero (Direct3D, Vulkan with robustBufferAccess2)
  • Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess)
  • Arbitrary values, but can’t crash (OpenGL ES)

OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result.

As an example, a uniform buffer load without robustness might look like:

load.i32 result, buffer, index

Robustness adds a single unsigned minimum (umin) instruction:

umin idx, index, last
load.i32 result, buffer, idx

Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency.

There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up.

We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot.

Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address:

Base plus stride times vertex plus offset

Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads.

One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like:

imad idx, stride/4, vertex, offset/4
load.i32 result, base, idx

The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction:

load.v4i32 result, base, vertex << 2

…with the hardware calculating the address:

Base plus 4 times vertex left shifted 2, which equals Base plus 16 times vertex

What about robustness?

We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”.

Let’s handle the latter problem. We can rewrite the addressing equation as:

Base plus offset, which is the attribute base, plus stride times vertex

That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness:

umin idx, vertex, last valid
load.v4i32 result, base, idx << 2

We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base:

Offset plus stride times vertex plus format

The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as:

Vertex less than or equal to the floor of size minus offset minus format divided by stride

The driver calculates the right-hand side and passes it into the shader.

One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes.

Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction.


In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes.

…But it would be no fun if our hardware was reasonable.

Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”.

For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality.

With robustness, the specifications all agree that image loads return…

  • Zero if the X- or Y-coordinate is out-of-bounds
  • Zero if the level is out-of-bounds

Meanwhile, image loads on the M1 GPU return…

  • Zero if the X- or Y-coordinate is out-of-bounds
  • Values from the last level if the level is out-of-bounds

Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance.

The obvious workaround is to never load from an invalid level:

if (level <= levels) {
    return imageLoad(x, y, level);
} else {
    return 0;
}

That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching:

vec4 data = imageLoad(x, y, level);

return (level <= levels) ? data : 0;

This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not:

image_load R, x, y, level
ulesel R[0], level, levels, R[0], 0
ulesel R[1], level, levels, R[1], 0
ulesel R[2], level, levels, R[2], 0
ulesel R[3], level, levels, R[3], 0

Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround:

bool valid = (level <= levels);
int x_ = valid ? x : 20000;

return imageLoad(x_, y, level);

Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly:

ulesel x_, level, levels, x, #20000
image_load R, x_, y, level

If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance.


Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

Vulkanised sign at google’s office Vulkanised sign at google’s office

Last week I had an exciting opportunity to attend the Vulkanised 2024 conference. For those of you not familar with the event, it is “The Premier Vulkan Developer Conference” hosted by the Vulkan working group from Khronos. With the excitement out of the way, I decided to write about some of the interesting information that came out of the conference.

A Few Presentations

My colleagues Iago, Stéphane, and Hyunjun each had the opportunity to present on some of their work into the wider Vulkan ecosystem.

Stéphane and Hyujun presenting Stéphane and Hyujun presenting

Stéphane & Hyunjun presented “Implementing a Vulkan Video Encoder From Mesa to Streamer”. They jointly talked about the work they performed to implement the Vulkan video extensions in Intel’s ANV Mesa driver as well as in GStreamer. This was an interesting presentation because you got to see how the new Vulkan video extensions affected both driver developers implementing the extensions and application developers making use of the extensions for real time video decoding and encoding. Their presentation is available on vulkan.org.

Iago presenting Iago presenting

Later my colleague Iago presented jointly with Faith Ekstrand (a well-known Linux graphic stack contributor from Collabora) on “8 Years of Open Drivers, including the State of Vulkan in Mesa”. They both talked about the current state of Vulkan in the open source driver ecosystem, and some of the benefits open source drivers have been able to take advantage of, like the common Vulkan runtime code and a shared compiler stack. You can check out their presentation for all the details.

Besides Igalia’s presentations, there were several more which I found interesting, with topics such as Vulkan developer tools, experiences of using Vulkan in real work applications, and even how to teach Vulkan to new developers. Here are some highlights for some of them.

Using Vulkan Synchronization Validation Effectively

John Zulauf had a presentation of the Vulkan synchronization validation layers that he has been working on. If you are not familiar with these, then you should really check them out. They work by tracking how resources are used inside Vulkan and providing error messages with some hints if you use a resource in a way where it is not synchronized properly. It can’t catch every error, but it’s a great tool in the toolbelt of Vulkan developers to make their lives easier when it comes to debugging synchronization issues. As John said in the presentation, synchronization in Vulkan is hard, and nearly every application he tested the layers on reveled a synchronization issue, no matter how simple it was. He can proudly say he is a vkQuake contributor now because of these layers.

6 Years of Teaching Vulkan with Example for Video Extensions

This was an interesting presentation from a professor at the university of Vienna about his experience teaching graphics as well as game development to students who may have little real programming experience. He covered the techniques he uses to make learning easier as well as resources that he uses. This would be a great presentation to check out if you’re trying to teach Vulkan to others.

Vulkan Synchronization Made Easy

Another presentation focused on Vulkan sync, but instead of debugging it, Grigory showed how his graphics library abstracts sync away from the user without implementing a render graph. He presented an interesting technique that is similar to how the sync validation layers work when it comes ensuring that resources are always synchronized before use. If you’re building your own engine in Vulkan, this is definitely something worth checking out.

Vulkan Video Encode API: A Deep Dive

Tony at Nvidia did a deep dive into the new Vulkan Video extensions, explaining a bit about how video codecs work, and also including a roadmap for future codec support in the video extensions. Especially interesting for us was that he made a nice call-out to Igalia and our work on Vulkan Video CTS and open source driver support on slide (6) :)

Thoughts on Vulkanised

Vulkanised is an interesting conference that gives you the intersection of people working on Vulkan drivers, game developers using Vulkan for their graphics backend, visual FX tool developers using Vulkan-based tools in their pipeline, industrial application developers using Vulkan for some embedded commercial systems, and general hobbyists who are just interested in Vulkan. As an example of some of these interesting audience members, I got to talk with a member of the Blender foundation about his work on the Vulkan backend to Blender.

Lastly the event was held at Google’s offices in Sunnyvale. Which I’m always happy to travel to, not just for the better weather (coming from Canada), but also for the amazing restaurants and food that’s in the Bay Area!

Great bay area food Great bay area food
February 09, 2024

3D Printing Slicers

 I recently replaced my Flashforge Adventurer 3 printer that I had been using for a few years as my first printer with a BambuLab X1 Carbon, wanting a printer that was not a “project” so I could focus on modelling and printing. It's an investment, but my partner convinced me that I was using the printer often enough to warrant it, and told me to look out for Black Friday sales, which I did.

The hardware-specific slicer, Bambu Studio, was available for Linux, but only as an AppImage, with many people reporting crashes on startup, non-working video live view, and other problems that the hardware maker tried to work-around by shipping separate AppImage variants for Ubuntu and Fedora.

After close to 150 patches to the upstream software (which, in hindsight, I could probably have avoided by compiling the C++ code with LLVM), I manage to “flatpak” the application and make it available on Flathub. It's reached 3k installs in about a month, which is quite a bit for a niche piece of software.

Note that if you click the “Donate” button on the Flathub page, it will take you a page where you can feed my transformed fossil fuel addiction buy filament for repairs and printing perfectly fitting everyday items, rather than bulk importing them from the other side of the planet.

Screenshot
 

Preparing a Game Gear consoliser shell

I will continue to maintain the FlashPrint slicer for FlashForge printers, installed by nearly 15k users, although I enabled automated updates now, and will not be updating the release notes, which required manual intervention.

FlashForge have unfortunately never answered my queries about making this distribution of their software official (and fixing the crash when using a VPN...).

 Rhythmbox

As I was updating the Rhythmbox Flatpak on Flathub, I realised that it just reached 250k installs, which puts the number of installations of those 3D printing slicers above into perspective.

rhythmbox-main-window.png 

The updated screenshot used on Flathub

Congratulations, and many thanks, to all the developers that keep on contributing to this very mature project, especially Jonathan Matthew who's been maintaining the app since 2008.

February 08, 2024

After the open-source driver for VeriSilicon's Vivante NPU was merged into Mesa two weeks ago, I have been taking some rest and thinking about what will come next.

Automated testing

I have a merge request to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. Collabora is graciously looking at it. Thanks!

Performance

I have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.

It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.

IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.

Weight compression

Something I started working on last week is compression of zero values in the weight buffers. Sparsity is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.

By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).

By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.



As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.

IRC channel

I mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click here to join via a web interface, though I recommend setting up an account at matrix.org.

What next

Should I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?

I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.

I'm currently available for contracting, so I should be able to get on your project full-time on short notice.

February 07, 2024

HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.

If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)

I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!

It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.

What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!

Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:

cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefix
ninja -C build install

In some cases, additional variables need to be set.

Step 1: clang and lld

We're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'

Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.

Step 2: Those pesky cmake files

Build and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.

Step 3: libhsa-runtime64.so

This is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.

Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through libhsa-runtime64.so), and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like libamdhip64.so that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.

Anyway, libhsa-runtime64.so comes in two parts for what I am being told are largely historical reasons:

The former is statically linked into the latter...

Step 4: It which must not be named

For Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.

In any case, the required components can each be built individually against the upstream LLVM from step 1:

  • hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want it
  • device-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...
  • comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in libhsa-runtime64.so.

That last one is annoying. It needs a -DBUILD_TESTING=OFF

Worse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.

Step 5: libamdhip64.so

Finally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.

It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.

You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.

CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/bin

Step 6: Compiling with Clang

We can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:

clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cpp
LD_LIBRARY_PATH=$HOME/prefix/lib ./a.out

Neat, huh?

February 06, 2024

I attended FOSDEM last weekend and had the pleasure to participate in the Flathub / Flatpak BOF on Saturday. A lot of the session was used up by an extensive discussion about the merits (or not) of allowing direct uploads versus building everything centrally on Flathub’s infrastructure, and related concerns such as automated security/dependency scanning.

My original motivation behind the idea was essentially two things. The first was to offer a simpler way forward for applications that use language-specific build tools that resolve and retrieve their own dependencies from the internet. Flathub doesn’t allow network access during builds, and so a lot of manual work and additional tooling is currently needed (see Python and Electron Flatpak guides). And the second was to offer a maybe more familiar flow to developers from other platforms who would just build something and then run another command to upload it to the store, without having to learn the syntax of a new build tool. There were many valid concerns raised in the room, and I think on reflection that this is still worth doing, but might not be as valuable a way forward for Flathub as I had initially hoped.

Of course, for a proprietary application where Flathub never sees the source or where it’s built, whether that binary is uploaded to us or downloaded by us doesn’t change much. But for an FLOSS application, a direct upload driven by the developer causes a regression on a number of fronts. We’re not getting too hung up on the “malicious developer inserts evil code in the binary” case because Flathub already works on the model of verifying the developer and the user makes a decision to trust that app – we don’t review the source after all. But we do lose other things such as our infrastructure building on multiple architectures, and visibility on whether the build environment or upload credentials have been compromised unbeknownst to the developer.

There is now a manual review process for when apps change their metadata such as name, icon, license and permissions – which would apply to any direct uploads as well. It was suggested that if only heavily sandboxed apps (eg no direct filesystem access without proper use of portals) were permitted to make direct uploads, the impact of such concerns might be somewhat mitigated by the sandboxing.

However, it was also pointed out that my go-to example of “Electron app developers can upload to Flathub with one command” was also a bit of a fiction. At present, none of them would pass that stricter sandboxing requirement. Almost all Electron apps run old versions of Chromium with less complete portal support, needing sandbox escapes to function correctly, and Electron (and Chromium’s) sandboxing still needs additional tooling/downstream patching to run inside a Flatpak. Buh-boh.

I think for established projects who already ship their own binaries from their own centralised/trusted infrastructure, and for developers who have understandable sensitivities about binary integrity such such as encryption, password or financial tools, it’s a definite improvement that we’re able to set up direct uploads with such projects with less manual work. There are already quite a few applications – including verified ones – where the build recipe simply fetches a binary built elsewhere and unpacks it, and if this already done centrally by the developer, repeating the exercise on Flathub’s server adds little value.

However for the individual developer experience, I think we need to zoom out a bit and think about how to improve this from a tools and infrastructure perspective as we grow Flathub, and as we seek to raise funds for different sources for these improvements. I took notes for everything that was mentioned as a tooling limitation during the BOF, along with a few ideas about how we could improve things, and hope to share these soon as part of an RFP/RFI (Request For Proposals/Request for Information) process. We don’t have funding yet but if we have some prospective collaborators to help refine the scope and estimate the cost/effort, we can use this to go and pursue funding opportunities.

February 05, 2024

 Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.

The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.

Update: the current branch decodes one frame properly, reference frames need more work unfortunately.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/anv-vulkan-video-decode-av1

February 02, 2024

The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.

This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.

I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves!

[1]: https://www.khronos.org/blog/khronos-releases-vulkan-video-av1-decode-extension-vulkan-sdk-now-supports-h.264-h.265-encode 

[2]  https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-decode-av1

[3] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27424

January 31, 2024

 A few months have passed since New Responsibilities was posted, so I thought I would provide an update.

Projects Maintenance

Of all the freedesktop projects I created and maintained, only one doesn't have a new maintainer, low-memory-monitor.

This daemon is what the GMemoryMonitor GLib API is based on, so it can't be replaced trivially. Efforts seem to be under way to replace it with systemd APIs.

As for the other daemons:

(As an aside, there's posturing towards replacing power-profiles-daemon with tuned in Fedora. I would advise stakeholders to figure out whether having a large Python script in the boot hot path is a good idea, taking a look at bootcharts, and then thinking about whether hardware manufacturers would be able to help with supporting a tool with so many moving parts. Useful for tinkering, not for shipping in a product)

Updated responsibilities

Since mid-August, I've joined the Platform Enablement Team. Right now, I'm helping out with maintenance of the Bluetooth kernel stack in RHEL (and thus CentOS).

The goal is to eventually pivot to hardware enablement, which is likely to involve backporting and testing, more so than upstream enablement. This is currently dependent on attending some formal kernel development (and debugging) training sessions which should make it easier to see where my hodge-podge kernel knowledge stands.

Blog backlog

Before being moved to a different project, and apart from the usual and very time-consuming bug triage, user support and project maintenance, I also worked on a few new features. I have a few posts planned that will lay that out.

January 29, 2024

This is a follow-up from our Spam-label approach, but this time with MOAR EMOJIS because that's what the world is turning into.

Since March 2023 projects could apply the "Spam" label on any new issue and have a magic bot come in and purge the user account plus all issues they've filed, see the earlier post for details. This works quite well and gives every project member the ability to quickly purge spam. Alas, pesky spammers are using other approaches to trick google into indexing their pork [1] (because at this point I think all this crap is just SEO spam anyway). Such as commenting on issues and merge requests. We can't apply labels to comments, so we found a way to work around that: emojis!

In GitLab you can add "reactions" to issue/merge request/snippet comments and in recent GitLab versions you can register for a webhook to be notified when that happens. So what we've added to the gitlab.freedesktop.org instance is support for the :do_not_litter: (🚯) emoji [2] - if you set that on an comment the author of said comment will be blocked and the comment content will be removed. After some safety checks of course, so you can't just go around blocking everyone by shotgunning emojis into gitlab. Unlike the "Spam" label this does not currently work recursively so it's best to report the user so admins can purge them properly - ideally before setting the emoji so the abuse report contains the actual spam comment instead of the redacted one. Also note that there is a 30 second grace period to quickly undo the emoji if you happen to set it accidentally.

Note that for purging issues, the "Spam" label is still required, the emojis only work for comments.

Happy cleanup!

[1] or pork-ish
[2] Benjamin wanted to use :poop: but there's a chance that may get used for expressing disagreement with the comment in question

January 26, 2024
Many recent Intel laptops have replaced the standard UVC USB camera module with a raw MIPI camera-sensor connected to the IPU6 found in recent Intel laptop chips.

Both the hw interface of the ISP part of the IPU6 as well as the image processing algorithms used are considered a trade secret and so far the only Linux support for the IPU6 relies on an out of tree kernel driver with a proprietary userspace stack on top, which is currently available in rpmfusion.

Both Linaro and Red Hat have identified the missing ISP support for various ARM and X86 chips as a problem. Linaro has started a project to add a SoftwareISP component to libcamera to allow these cameras to work without needing proprietary software and Red Hat has joined Linaro in working on this.

FOSDEM talk

Bryan O'Donoghue (Linaro) and I are giving a talk about this at FOSDEM.

Fedora COPR repository

This work is at a point now where it is ready for wider testing. A Fedora COPR repository with a patched kernel and libcamera is now available for users to test, see the COPR page for install and test instructions.

This has been tested on the following devices:
  • Lenovo ThinkPad X1 yoga gen 8 (should work on any ThinkPad with ov2740 sensor)
  • Dell Latitude 9420 (ov01a1s sensor)
  • HP Spectre x360 13.5 (2023 model, hi556 sensor)

Description of the stack
  1. Kernel driver for the camera sensor, for the ov2740 used on current Lenovo designs (excluding MTL) I have landed all necessary kernel changes for this upstream.
  2. Kernel support for the CSI receiver part of the IPU6 Intel is working on upstreaming this and has recently posted v3 of their patch series for this upstream and this is under active review.
  3. A FOSS Software ISP stack inside libcamera to replace the missing IPU6 ISP (processing-system/psys) support. Work on this is under way. I've recently send out v2 of the patch-series for this.
  4. Firefox pipewire camera support and support for the camera portal to get permission to access the camera. My colleague Jan Grulich has been working on this, see Jan's blogpost. Jan's work has landed in the just released Firefox 122.

January 24, 2024

Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.

For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license.


This will have the following advantages for the project:

  1. The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.
  2. Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.
  3. The project has great technical infrastructure, maintained by awesome sysadmins:
  4. More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:
    • The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.
    • The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.
  5. And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.

The story so far

In 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.

The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.

I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).

Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.

With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.

During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.

By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.

Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!

The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.

Next steps

Now that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!

I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.

There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.

There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.

And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.

Thanks

One doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:

Collabora for allowing me to start playing with this while I still worked with them.

Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.

Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv.

Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.

Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.

Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.

And a collective thanks to the DRI/Mesa community for being so awesome!

January 22, 2024

Time flies! Back in October, Igalia organized X.Org Developers Conference 2023 in A Coruña, Spain.

In case you don’t know it, X.Org Developers Conference, despite the X.Org in the name, is a conference for all developers working in the open-source graphics stack: anything related to DRM/KMS, Mesa, X11 and Wayland compositors, etc.

A Coruña's Orzán beach

This year, I participated in the organization of XDC in A Coruña, Spain (again!) by taking care of different aspects: from logistics in the venue (Palexco) to running it in person. It was a very tiring but fulfilling experience.

Sponsors

First of all, I would like to thank all the sponsors for their support, as without them, this conference wouldn’t happen:

XDC 2023 sponsors

They didn’t only give economic support to the conference: Igalia sponsored the welcome event and lunches; X.Org Foundation sponsored coffee breaks; Tourism Office of A Coruña sponsored the guided tour in the city center; and Raspberry Pi sent Raspberry Pi 5 boards to all speakers!

XDC 2023 Stats

XDC 2023 was a success on attendance and talks submissions. Here you have some stats:

  • 📈 160 registered attendees.
  • 👬 120 attendees picked their badge in person.
  • 💻 25 attendees registered as virtual.
  • 📺 More than 6,000 views on live stream.
  • 📝 55 talks/workshops/demos distributed in three days of conference..
  • 🧗‍♀️ There were 3 social events: welcome event, city center guide tour, and one unofficial climbing activity!

XDC 2023 welcome event

Was XDC 2023 perfect organization-wise? Of course… no! Like in any event, we had some issues here and there: one with the Wi-Fi network that was quickly detected and fixed; some issues with the meals and coffee breaks (food allergies mainly), we lost some seconds of audio of a talk in the on-live streaming, and other minor things. Not bad for a community-run event!

Nevertheless, I would like to thank all the staff at Palexco for their quick response and their understanding.

Talk recordings & slides

XDC 2023 talk by André Almeida

Want to see again some talks? All conference recordings were uploaded to X.Org Foundation Youtube channel.

Slides are available to download in each talk description.

Enjoy!

XDC 2024

XDC 2024 will be in North America

We cannot tell yet where is going to happen XDC 2024, other than it will be in North America… but I can tell you that this will be announced soon. Stay tuned!

Want to organize XDC 2025 or XDC 2026?

If we continue with the current cadence: 2025 would be again in Europe, and 2026 event would be in North America.

There is a list of requirements here. Nevertheless, feel free to contact me, or to the X.Org Board of Directors, in order to get first-hand experience and knowledge about what organizing XDC entails.

XDC 2023 audience

Thanks

Thanks to all volunteers, collaborators, Palexco staff, GPUL, X.Org Foundation and many other people for their hard work. Special thanks to my Igalia colleague Chema, who did an outstanding job organizing the event together with me.

Thanks for the sponsors for their extraordinary support to this conference.

Thanks to Igalia not only for sponsoring the event, but also for all the support I got during the past year. I am glad to be part of this company, and I am always surprised by how great my colleagues are.

And last, but not least, thanks to all speakers and attendees. Without you, the conference won’t exist.

See you at XDC 2024!

January 20, 2024

Hi! This month has been pretty hectic due to the SourceHut network outage. We’ve all in the staff team invested a lot of time and energy to minimize the downtime as much as possible. Thankfully things have settled down now, there are still a lot of follow-up tasks to complete but with less urgency. I’m really grateful for the community’s reaction, everybody had been very understanding and supportive. Thank you!

In other SourceHut news, I’ve been working on yojo, a bridge which provides CI for Codeberg projects via builds.sr.ht. I’ve added support for pull requests, taught yojo to handle multiple manifests, added logic to automatically refresh access tokens before they expire, and fixed a bunch of bugs.

The NPotM is sr.ht-container-compose, a docker-compose configuration for SourceHut. It provides an easy way to spin up a SourceHut development environment without having to set up each service and its dependencies individually. I hope this project can reduce friction for new SourceHut contributors. There are many services missing, patches welcome!

This month, we’ve finally merged the Sway pull request to use the wlroots scene-graph API! This is exciting because it fixes a whole class of bugs, it removes a lot of manual hand-rolled logic in Sway (e.g. rendering, damage tracking, input event routing, direct scan-out, some of the protocol support…), it provides nice performance optimizations via culling (e.g. the background image is no longer painted if a web browser is covering it), and it unlocks upcoming performance optimizations (e.g. KMS plane offloading). Many thanks to Alexander for writing the patches and maintaining them for over a year, and to Kirill for pushing it over the finish line!

On the wlroots side, my work on wlr_surface_synced has been merged, allowing us to latch surface commits until an arbitrary condition is met. This work is necessary for the upcoming explicit synchronization protocol, as well as the work-in-progress transactions protocol and avoiding compositor freezes when a client is very slow to render. We’ve released wlroots 0.17.1, with a collection of bugfixes backported by Simon Zeni. Last, we’ve dropped support for the legacy wl_drm protocol by default, and this caused a bit of breakage here and there (xserver, libva, amdvlk). We do really want to phase out wl_drm though, so we’ve decided to stick with that removal.

This month’s collection of miscellaneous project updates includes go-imap v2 alpha 8 with separate types for sequence numbers and UIDs, which was a lot of work to get right but I think was worth it. I’ve also released go-maildir v0.4.0 with a new Walk function (to iterate over messages without allocating a list) and numerous fixes. I’ve sent a GitLab cli patch to fix invalid release asset links for third-party GitLab instances, and a Meson patch to add C23 support.

See you next month!

January 11, 2024

This post is in part a response to an aspect of Nate’s post “Does Wayland really break everything?“, but also my reflection on discussing Wayland protocol additions, a unique pleasure that I have been involved with for the past months1.

Some facts

Before I start I want to make a few things clear: The Linux desktop will be moving to Wayland2 – this is a fact at this point (and has been for a while), sticking to X11 makes no sense for future projects. From reading Wayland protocols and working with it at a much lower level than I ever wanted to, it is also very clear to me that Wayland is an exceptionally well-designed core protocol, and so are the additional extension protocols (xdg-shell & Co.). The modularity of Wayland is great, it gives it incredible flexibility and will for sure turn out to be good for the long-term viability of this project (and also provides a path to correct protocol issues in future, if one is found). In other words: Wayland is an amazing foundation to build on, and a lot of its design decisions make a lot of sense!

The shift towards people seeing “Linux” more as an application developer platform, and taking PipeWire and XDG Portals into account when designing for Wayland is also an amazing development and I love to see this – this holistic approach is something I always wanted!

Furthermore, I think Wayland removes a lot of functionality that shouldn’t exist in a modern compositor – and that’s a good thing too! Some of X11’s features and design decisions had clear drawbacks that we shouldn’t replicate. I highly recommend to read Nate’s blog post, it’s very good and goes into more detail. And due to all of this, I firmly believe that any advancement in the Wayland space must come from within the project.

But!

But! Of course there was a “but” coming 😉 – I think while developing Wayland-as-an-ecosystem we are now entrenched into narrow concepts of how a desktop should work. While discussing Wayland protocol additions, a lot of concepts clash, people from different desktops with different design philosophies debate the merits of those over and over again never reaching any conclusion (just as you will never get an answer out of humans whether sushi or pizza is the clearly superior food, or whether CSD or SSD is better). Some people want to use Wayland as a vehicle to force applications to submit to their desktop’s design philosophies, others prefer the smallest and leanest protocol possible, other developers want the most elegant behavior possible. To be clear, I think those are all very valid approaches.

But this also creates problems: By switching to Wayland compositors, we are already forcing a lot of porting work onto toolkit developers and application developers. This is annoying, but just work that has to be done. It becomes frustrating though if Wayland provides toolkits with absolutely no way to reach their goal in any reasonable way. For Nate’s Photoshop analogy: Of course Linux does not break Photoshop, it is Adobe’s responsibility to port it. But what if Linux was missing a crucial syscall that Photoshop needed for proper functionality and Adobe couldn’t port it without that? In that case it becomes much less clear on who is to blame for Photoshop not being available.

A lot of Wayland protocol work is focused on the environment and design, while applications and work to port them often is considered less. I think this happens because the overlap between application developers and developers of the desktop environments is not necessarily large, and the overlap with people willing to engage with Wayland upstream is even smaller. The combination of Windows developers porting apps to Linux and having involvement with toolkits or Wayland is pretty much nonexistent. So they have less of a voice.

A quick detour through the neuroscience research lab

I have been involved with Freedesktop, GNOME and KDE for an incredibly long time now (more than a decade), but my actual job (besides consulting for Purism) is that of a PhD candidate in a neuroscience research lab (working on the morphology of biological neurons and its relation to behavior). I am mostly involved with three research groups in our institute, which is about 35 people. Most of us do all our data analysis on powerful servers which we connect to using RDP (with KDE Plasma as desktop). Since I joined, I have been pushing the envelope a bit to extend Linux usage to data acquisition and regular clients, and to have our data acquisition hardware interface well with it. Linux brings some unique advantages for use in research, besides the obvious one of having every step of your data management platform introspectable with no black boxes left, a goal I value very highly in research (but this would be its own blogpost).

In terms of operating system usage though, most systems are still Windows-based. Windows is what companies develop for, and what people use by default and are familiar with. The choice of operating system is very strongly driven by application availability, and WSL being really good makes this somewhat worse, as it removes the need for people to switch to a real Linux system entirely if there is the occasional software requiring it. Yet, we have a lot more Linux users than before, and use it in many places where it makes sense. I also developed a novel data acquisition software that even runs on Linux-only and uses the abilities of the platform to its fullest extent. All of this resulted in me asking existing software and hardware vendors for Linux support a lot more often. Vendor-customer relationship in science is usually pretty good, and vendors do usually want to help out. Same for open source projects, especially if you offer to do Linux porting work for them… But overall, the ease of use and availability of required applications and their usability rules supreme. Most people are not technically knowledgeable and just want to get their research done in the best way possible, getting the best results with the least amount of friction.

KDE/Linux usage at a control station for a particle accelerator at Adlershof Technology Park, Germany, for reference (by 25years of KDE)3

Back to the point

The point of that story is this: GNOME, KDE, RHEL, Debian or Ubuntu: They all do not matter if the necessary applications are not available for them. And as soon as they are, the easiest-to-use solution wins. There are many facets of “easiest”: In many cases this is RHEL due to Red Hat support contracts being available, in many other cases it is Ubuntu due to its mindshare and ease of use. KDE Plasma is also frequently seen, as it is perceived a bit easier to onboard Windows users with it (among other benefits). Ultimately, it comes down to applications and 3rd-party support though.

Here’s a dirty secret: In many cases, porting an application to Linux is not that difficult. The thing that companies (and FLOSS projects too!) struggle with and will calculate the merits of carefully in advance is whether it is worth the support cost as well as continuous QA/testing. Their staff will have to do all of that work, and they could spend that time on other tasks after all.

So if they learn that “porting to Linux” not only means added testing and support, but also means to choose between the legacy X11 display server that allows for 1:1 porting from Windows or the “new” Wayland compositors that do not support the same features they need, they will quickly consider it not worth the effort at all. I have seen this happen.

Of course many apps use a cross-platform toolkit like Qt, which greatly simplifies porting. But this just moves the issue one layer down, as now the toolkit needs to abstract Windows, macOS and Wayland. And Wayland does not contain features to do certain things or does them very differently from e.g. Windows, so toolkits have no way to actually implement the existing functionality in a way that works on all platforms. So in Qt’s documentation you will often find texts like “works everywhere except for on Wayland compositors or mobile”4.

Many missing bits or altered behavior are just papercuts, but those add up. And if users will have a worse experience, this will translate to more support work, or people not wanting to use the software on the respective platform.

What’s missing?

Window positioning

SDI applications with multiple windows are very popular in the scientific world. For data acquisition (for example with microscopes) we often have one monitor with control elements and one larger one with the recorded image. There is also other configurations where multiple signal modalities are acquired, and the experimenter aligns windows exactly in the way they want and expects the layout to be stored and to be loaded upon reopening the application. Even in the image from Adlershof Technology Park above you can see this style of UI design, at mega-scale. Being able to pop-out elements as windows from a single-window application to move them around freely is another frequently used paradigm, and immensely useful with these complex apps.

It is important to note that this is not a legacy design, but in many cases an intentional choice – these kinds of apps work incredibly well on larger screens or many screens and are very flexible (you can have any window configuration you want, and switch between them using the (usually) great window management abilities of your desktop).

Of course, these apps will work terribly on tablets and small form factors, but that is not the purpose they were designed for and nobody would use them that way.

I assumed for sure these features would be implemented at some point, but when it became clear that that would not happen, I created the ext-placement protocol which had some good discussion but was ultimately rejected from the xdg namespace. I then tried another solution based on feedback, which turned out not to work for most apps, and now proposed xdg-placement (v2) in an attempt to maybe still get some protocol done that we can agree on, exploring more options before pushing the existing protocol for inclusion into the ext Wayland protocol namespace. Meanwhile though, we can not port any application that needs this feature, while at the same time we are switching desktops and distributions to Wayland by default.

Window position restoration

Similarly, a protocol to save & restore window positions was already proposed in 2018, 6 years ago now, but it has still not been agreed upon, and may not even help multiwindow apps in its current form. The absence of this protocol means that applications can not restore their former window positions, and the user has to move them to their previous place again and again.

Meanwhile, toolkits can not adopt these protocols and applications can not use them and can not be ported to Wayland without introducing papercuts.

Window icons

Similarly, individual windows can not set their own icons, and not-installed applications can not have an icon at all because there is no desktop-entry file to load the icon from and no icon in the theme for them. You would think this is a niche issue, but for applications that create many windows, providing icons for them so the user can find them is fairly important. Of course it’s not the end of the world if every window has the same icon, but it’s one of those papercuts that make the software slightly less user-friendly. Even applications with fewer windows like LibrePCB are affected, so much so that they rather run their app through Xwayland for now.

I decided to address this after I was working on data analysis of image data in a Python virtualenv, where my code and the Python libraries used created lots of windows all with the default yellow “W” icon, making it impossible to distinguish them at a glance. This is xdg-toplevel-icon now, but of course it is an uphill battle where the very premise of needing this is questioned. So applications can not use it yet.

Limited window abilities requiring specialized protocols

Firefox has a picture-in-picture feature, allowing it to pop out media from a mediaplayer as separate floating window so the user can watch the media while doing other things. On X11 this is easily realized, but on Wayland the restrictions posed on windows necessitate a different solution. The xdg-pip protocol was proposed for this specialized usecase, but it is also not merged yet. So this feature does not work as well on Wayland.

Automated GUI testing / accessibility / automation

Automation of GUI tasks is a powerful feature, so is the ability to auto-test GUIs. This is being worked on, with libei and wlheadless-run (and stuff like ydotool exists too), but we’re not fully there yet.

Wayland is frustrating for (some) application authors

As you see, there is valid applications and valid usecases that can not be ported yet to Wayland with the same feature range they enjoyed on X11, Windows or macOS. So, from an application author’s perspective, Wayland does break things quite significantly, because things that worked before can no longer work and Wayland (the whole stack) does not provide any avenue to achieve the same result.

Wayland does “break” screen sharing, global hotkeys, gaming latency (via “no tearing”) etc, however for all of these there are solutions available that application authors can port to. And most developers will gladly do that work, especially since the newer APIs are usually a lot better and more robust. But if you give application authors no path forward except “use Xwayland and be on emulation as second-class citizen forever”, it just results in very frustrated application developers.

For some application developers, switching to a Wayland compositor is like buying a canvas from the Linux shop that forces your brush to only draw triangles. But maybe for your avant-garde art, you need to draw a circle. You can approximate one with triangles, but it will never be as good as the artwork of your friends who got their canvases from the Windows or macOS art supply shop and have more freedom to create their art.

Triangles are proven to be the best shape! If you are drawing circles you are creating bad art!

Wayland, via its protocol limitations, forces a certain way to build application UX – often for the better, but also sometimes to the detriment of users and applications. The protocols are often fairly opinionated, a result of the lessons learned from X11. In any case though, it is the odd one out – Windows and macOS do not pose the same limitations (for better or worse!), and the effort to port to Wayland is orders of magnitude bigger, or sometimes in case of the multiwindow UI paradigm impossible to achieve to the same level of polish. Desktop environments of course have a design philosophy that they want to push, and want applications to integrate as much as possible (same as macOS and Windows!). However, there are many applications out there, and pushing a design via protocol limitations will likely just result in fewer apps.

The porting dilemma

I spent probably way too much time looking into how to get applications cross-platform and running on Linux, often talking to vendors (FLOSS and proprietary) as well. Wayland limitations aren’t the biggest issue by far, but they do start to come come up now, especially in the scientific space with Ubuntu having switched to Wayland by default. For application authors there is often no way to address these issues. Many scientists do not even understand why their Python script that creates some GUIs suddenly behaves weirdly because Qt is now using the Wayland backend on Ubuntu instead of X11. They do not know the difference and also do not want to deal with these details – even though they may be programmers as well, the real goal is not to fiddle with the display server, but to get to a scientific result somehow.

Another issue is portability layers like Wine which need to run Windows applications as-is on Wayland. Apparently Wine’s Wayland driver has some heuristics to make window positioning work (and I am amazed by the work done on this!), but that can only go so far.

A way out?

So, how would we actually solve this? Fundamentally, this excessively long blog post boils down to just one essential question:

Do we want to force applications to submit to a UX paradigm unconditionally, potentially loosing out on application ports or keeping apps on X11 eternally, or do we want to throw them some rope to get as many applications ported over to Wayland, even through we might sacrifice some protocol purity?

I think we really have to answer that to make the discussions on wayland-protocols a lot less grueling. This question can be answered at the wayland-protocols level, but even more so it must be answered by the individual desktops and compositors.

If the answer for your environment turns out to be “Yes, we want the Wayland protocol to be more opinionated and will not make any compromises for application portability”, then your desktop/compositor should just immediately NACK protocols that add something like this and you simply shouldn’t engage in the discussion, as you reject the very premise of the new protocol: That it has any merit to exist and is needed in the first place. In this case contributors to Wayland and application authors also know where you stand, and a lot of debate is skipped. Of course, if application authors want to support your environment, you are basically asking them now to rewrite their UI, which they may or may not do. But at least they know what to expect and how to target your environment.

If the answer turns out to be “We do want some portability”, the next question obviously becomes where the line should be drawn and which changes are acceptable and which aren’t. We can’t blindly copy all X11 behavior, some porting work to Wayland is simply inevitable. Some written rules for that might be nice, but probably more importantly, if you agree fundamentally that there is an issue to be fixed, please engage in the discussions for the respective MRs! We for sure do not want to repeat X11 mistakes, and I am certain that we can implement protocols which provide the required functionality in a way that is a nice compromise in allowing applications a path forward into the Wayland future, while also being as good as possible and improving upon X11. For example, the toplevel-icon proposal is already a lot better than anything X11 ever had. Relaxing ACK requirements for the ext namespace is also a good proposed administrative change, as it allows some compositors to add features they want to support to the shared repository easier, while also not mandating them for others. In my opinion, it would allow for a lot less friction between the two different ideas of how Wayland protocol development should work. Some compositors could move forward and support more protocol extensions, while more restrictive compositors could support less things. Applications can detect supported protocols at launch and change their behavior accordingly (ideally even abstracted by toolkits).

You may now say that a lot of apps are ported, so surely this issue can not be that bad. And yes, what Wayland provides today may be enough for 80-90% of all apps. But what I hope the detour into the research lab has done is convince you that this smaller percentage of apps matters. A lot. And that it may be worthwhile to support them.

To end on a positive note: When it came to porting concrete apps over to Wayland, the only real showstoppers so far5 were the missing window-positioning and window-position-restore features. I encountered them when porting my own software, and I got the issue as feedback from colleagues and fellow engineers. In second place was UI testing and automation support, the window-icon issue was mentioned twice, but being a cosmetic issue it likely simply hurts people less and they can ignore it easier.

What this means is that the majority of apps are already fine, and many others are very, very close! A Wayland future for everyone is within our grasp! 😄

I will also bring my two protocol MRs to their conclusion for sure, because as application developers we need clarity on what the platform (either all desktops or even just a few) supports and will or will not support in future. And the only way to get something good done is by contribution and friendly discussion.

Footnotes

  1. Apologies for the clickbait-y title – it comes with the subject 😉 ↩
  2. When I talk about “Wayland” I mean the combined set of display server protocols and accepted protocol extensions, unless otherwise clarified. ↩
  3. I would have picked a picture from our lab, but that would have needed permission first ↩
  4. Qt has awesome “platform issues” pages, like for macOS and Linux/X11 which help with porting efforts, but Qt doesn’t even list Linux/Wayland as supported platform. There is some information though, like window geometry peculiarities, which aren’t particularly helpful when porting (but still essential to know). ↩
  5. Besides issues with Nvidia hardware – CUDA for simulations and machine-learning is pretty much everywhere, so Nvidia cards are common, which causes trouble on Wayland still. It is improving though. ↩

Igalia is always working hard to improve 3D rendering drivers of the Broadcom VideoCore GPU, found in Raspberry Pi devices. One of our most recent efforts in this sense was the implementation of CPU jobs from the Vulkan driver to the V3D kernel driver.

What are CPU jobs and why do we need them?

In the V3DV driver, there are some Vulkan commands that cannot be performed by the GPU alone, so we implement those as CPU jobs on Mesa. A CPU job is a job that requires CPU intervention to be performed. For example, in the Broadcom VideoCore GPUs, we don’t have a way to calculate the timestamp. But we need the timestamp for Vulkan timestamp queries. Therefore, we need to calculate the timestamp on the CPU.

A CPU job in userspace also implies CPU stalling. Sometimes, we need to hold part of the command submission flow in order to correctly synchronize their execution. This waiting period caused the CPU to stall, thereby preventing the continuous submission of jobs to the GPU. To mitigate this issue, we decided to move CPU job mechanisms from the V3DV driver to the V3D kernel driver.

In the V3D kernel driver, we have different kinds of jobs: RENDER jobs, BIN jobs, CSD jobs, TFU jobs, and CLEAN CACHE jobs. For each of those jobs, we have a DRM scheduler instance that helps us to synchronize the jobs.

If you want to know more about the different kinds of V3D jobs, check out this November Update: Exploring V3D blogpost, where I explain more about all the V3D IOCTLs and jobs.

Jobs of the same kind are submitted, dispatched, and processed in the same order they are executed, using a standard first-in-first-out (FIFO) queue system. We can synchronize different jobs across different queues using DRM syncobjs. More about the V3D synchronization framework and user extensions can be learned in this two-part blog post from Melissa Wen.

From the kernel documentation, a DRM syncobj (synchronisation objects) are containers for stuff that helps sync up GPU commands. They’re super handy because you can use them in your own programs, share them with other programs, and even use them across different DRM drivers. Mostly, they’re used for making Vulkan fences and semaphores work.

By moving the CPU job from userspace to the kernel, we can make use of the DRM schedule queues and all the advantages it brings with it. For this, we created a new type of job in the V3D kernel driver, a CPU job, which also means creating a new DRM scheduler instance and a CPU job queue. Now, instead of stalling the submission thread waiting for the GPU to idle, we can use DRM syncobjs to synchronize both CPU and GPU jobs in a submission, providing more efficient usage of the GPU.

How did we implement the CPU jobs in the kernel driver?

After we decided to have a CPU job implementation in the kernel space, we could think about two possible implementations for this job: creating an IOCTL for each type of CPU job or using a user extension to provide a polymorphic behavior to a single CPU job IOCTL.

We have different types of CPU jobs (indirect CSD jobs, timestamp query jobs, copy query results jobs…) and each of them has a common infrastructure of allocation and synchronization but performs different operations. Therefore, we decided to go with the option to use user extensions.

On Melissa’s blogpost, she digs deep into the implementation of generic IOCTL extensions in the V3D kernel driver. But, to put it simply, instead of expanding the data struct for each IOCTL every time we need to add a new feature, we define a user extension chain instead. As we add new optional interfaces to control the IOCTL, we define a new extension struct that can be linked to the IOCTL data only when required by the user.

Therefore, we created a new IOCTL, drm_v3d_submit_cpu, which is used to submit any type of CPU job. This single IOCTL can be extended by a user extension, which allows us to reuse the common infrastructure - avoiding code repetition - and yet use the user extension ID to identify the type of job and depending on the type of job, perform a certain operation.

struct drm_v3d_submit_cpu {
        /* Pointer to a u32 array of the BOs that are referenced by the job.
         *
         * For DRM_V3D_EXT_ID_CPU_INDIRECT_CSD, it must contain only one BO,
         * that contains the workgroup counts.
         *
         * For DRM_V3D_EXT_ID_TIMESTAMP_QUERY, it must contain only one BO,
         * that will contain the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY, it must contain only
         * one BO, that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, it must contain two
         * BOs. The first is the BO where the timestamp queries will be written
         * to. The second is the BO that contains the timestamp.
         *
         * For DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY, it must contain no
         * BOs.
         *
         * For DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY, it must contain one
         * BO, where the performance queries will be written.
         */
        __u64 bo_handles;

        /* Number of BO handles passed in (size is that times 4). */
        __u32 bo_handle_count;

        __u32 flags;

        /* Pointer to an array of ioctl extensions*/
        __u64 extensions;
};

Now, we can create a CPU job and submit it with a CPU job user extension.

And which extensions are available?

  1. DRM_V3D_EXT_ID_CPU_INDIRECT_CSD: this CPU job allows us to submit an indirect CSD job. An indirect CSD job is a job that, when executed in the queue, will map an indirect buffer, read the dispatch parameters, and submit a regular dispatch. This CPU job is used in Vulkan calls like vkCmdDispatchIndirect().
  2. DRM_V3D_EXT_ID_CPU_TIMESTAMP_QUERY: this CPU job calculates the query timestamp and updates the query availability by signaling a syncobj. This CPU job is used in Vulkan calls like vkCmdWriteTimestamp().
  3. DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY: this CPU job resets the timestamp queries based on the value offset of the first query. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for timestamp queries.
  4. DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY: this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for timestamp queries.
  5. DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY: this CPU job resets the performance queries by resetting the values of the perfmons. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for performance queries.
  6. DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY: similar to DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for performance queries.

The CPU job IOCTL structure is similar to any other V3D job. We allocate the job struct, parse all the extensions, init the job, look up the BOs and lock its reservations, add the proper dependencies, and push the job to the DRM scheduler entity.

When running a CPU job, we execute the following code:

static const v3d_cpu_job_fn cpu_job_function[] = {
        [V3D_CPU_JOB_TYPE_INDIRECT_CSD] = v3d_rewrite_csd_job_wg_counts_from_indirect,
        [V3D_CPU_JOB_TYPE_TIMESTAMP_QUERY] = v3d_timestamp_query,
        [V3D_CPU_JOB_TYPE_RESET_TIMESTAMP_QUERY] = v3d_reset_timestamp_queries,
        [V3D_CPU_JOB_TYPE_COPY_TIMESTAMP_QUERY] = v3d_copy_query_results,
        [V3D_CPU_JOB_TYPE_RESET_PERFORMANCE_QUERY] = v3d_reset_performance_queries,
        [V3D_CPU_JOB_TYPE_COPY_PERFORMANCE_QUERY] = v3d_copy_performance_query,
};

static struct dma_fence *
v3d_cpu_job_run(struct drm_sched_job *sched_job)
{
        struct v3d_cpu_job *job = to_cpu_job(sched_job);
        struct v3d_dev *v3d = job->base.v3d;

        v3d->cpu_job = job;

        if (job->job_type >= ARRAY_SIZE(cpu_job_function)) {
                DRM_DEBUG_DRIVER("Unknown CPU job: %d\n", job->job_type);
                return NULL;
        }

        trace_v3d_cpu_job_begin(&v3d->drm, job->job_type);

        cpu_job_function[job->job_type](job);

        trace_v3d_cpu_job_end(&v3d->drm, job->job_type);

        return NULL;
}

The interesting thing is that each CPU job type executes a completely different operation.

The complete kernel implementation has already landed in drm-misc-next and can be seen right here.

What did we change in Mesa-V3DV to use the new kernel-V3D CPU job?

After landing the kernel implementation, I needed to accommodate the new CPU job approach in the userspace.

A fundamental rule is not to cause regressions, i.e., to keep backwards userspace compatibility with old versions of the Linux kernel. This means we cannot break new versions of Mesa running in old kernels. Therefore, we needed to create two paths: one preserving the old way to perform CPU jobs and the other using the kernel to perform CPU jobs.

So, for example, the indirect CSD job used to add two different jobs to the queue: a CPU job and a CSD job. Now, if we have the CPU job capability in the kernel, we only add a CPU job and the CSD job is dispatched from within the kernel.

-   list_addtail(&csd_job->list_link, &cmd_buffer->jobs);
+
+   /* If we have a CPU queue we submit the CPU job directly to the
+    * queue and the CSD job will be dispatched from within the kernel
+    * queue, otherwise we will have to dispatch the CSD job manually
+    * right after the CPU job by adding it to the list of jobs in the
+    * command buffer.
+    */
+   if (!cmd_buffer->device->pdevice->caps.cpu_queue)
+      list_addtail(&csd_job->list_link, &cmd_buffer->jobs);

Furthermore, now we can use syncobjs to sync the CPU jobs. For example, in the timestamp query CPU job, we used to stall the submission thread and wait for completion of all work queued before the timestamp query. Now, we can just add a barrier to the CPU job and it will be properly synchronized by the syncobjs without stalling the submission thread.

   /* The CPU job should be serialized so it only executes after all previously
    * submitted work has completed
    */
   job->serialize = V3DV_BARRIER_ALL;

We were able to test the implementation using multiple CTS tests, such as dEQP-VK.compute.pipeline.indirect_dispatch.*, dEQP-VK.pipeline.monolithic.timestamp.*, dEQP-VK.synchronization.*, dEQP-VK.query_pool.* and dEQP-VK.multiview.*.

The userspace implementation has already landed in Mesa and the full implementation can be checked in this MR.


More about the on-going challenges in the Raspberry Pi driver stack can be checked during this XDC 2023 talk presented by Iago Toral, Juan Suárez and myself. During this talk, Iago mentioned the CPU job work that we have been doing.

Also I cannot finish this post without thanking Melissa Wen and Iago Toral for all the help while developing the CPU jobs for the V3D kernel driver.

January 10, 2024

When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.

Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?

Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.

SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.

The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.

So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:

Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.

I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.

After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.

If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.

Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!

Next steps

I have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.

January 09, 2024

This is a guest post written by Daan De Meyer, systemd and mkosi maintainer

Almost 7 years ago, Lennart first wrote about mkosi on this blog. Some years ago, I took over development and there's been a huge amount of changes and improvements since then. So I figure this is a good time to re-introduce mkosi.

mkosi stands for Make Operating System Image. It generates OS images that can be used for a variety of purposes.

If you prefer watching a video over reading a blog post, you can also watch my presentation on mkosi at All Systems Go 2023.

What is mkosi?

mkosi was originally written as a tool to simplify hacking on systemd and for experimenting with images using many of the new concepts being introduced in systemd at the time. In the meantime, it has evolved into a general purpose image builder that can be used in a multitude of scenarios.

Instructions to install mkosi can be found in its readme. We recommend running the latest version to take advantage of all the latest features and bug fixes. You'll also need bubblewrap and the package manager of your favorite distribution to get started.

At its core, the workflow of mkosi can be divided into 3 steps:

  1. Generate an OS tree for some distribution by installing a set of packages.
  2. Package up that OS tree in a variety of output formats.
  3. (Optionally) Boot the resulting image in qemu or systemd-nspawn.

Images can be built for any of the following distributions:

  • Fedora Linux
  • Ubuntu
  • OpenSUSE
  • Debian
  • Arch Linux
  • CentOS Stream
  • RHEL
  • Rocky Linux
  • Alma Linux

And the following output formats are supported:

  • GPT disk images built with systemd-repart
  • Tar archives
  • CPIO archives (for building initramfs images)
  • USIs (Unified System Images which are full OS images packed in a UKI)
  • Sysext, confext and portable images
  • Directory trees

For example, to build an Arch Linux GPT disk image and boot it in qemu, you can run the following command:

$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu

To instead boot the image in systemd-nspawn, replace qemu with boot:

$ mkosi -d arch -p systemd -p udev -p linux -t disk boot

The actual image can be found in the current working directory named image.raw. However, using a separate output directory is recommended which is as simple as running mkdir mkosi.output.

To rebuild the image after it's already been built once, add -f to the command line before the verb to rebuild the image. Any arguments passed after the verb are forwarded to either systemd-nspawn or qemu itself. To build the image without booting it, pass build instead of boot or qemu or don't pass a verb at all.

By default, the disk image will have an appropriately sized root partition and an ESP partition, but the partition layout and contents can be fully customized using systemd-repart by creating partition definition files in mkosi.repart/. This allows you to customize the partition as you see fit:

  • The root partition can be encrypted.
  • Partition sizes can be customized.
  • Partitions can be protected with signed dm-verity.
  • You can opt out of having a root partition and only have a /usr partition instead.
  • You can add various other partitions, e.g. an XBOOTLDR partition or a swap partition.
  • ...

As part of building the image, we'll run various tools such as systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and more to make sure the image is set up correctly.

Configuring mkosi image builds

Naturally with extended use you don't want to specify all settings on the command line every time, so mkosi supports configuration files where the same settings that can be specified on the command line can be written down.

For example, the command we used above can be written down in a configuration file mkosi.conf:

[Distribution]
Distribution=arch

[Output]
Format=disk

[Content]
Packages=
        systemd
        udev
        linux

Like systemd, mkosi uses INI configuration files. We also support dropins which can be placed in mkosi.conf.d. Configuration files can also be conditionalized using the [Match] section. For example, to only install a specific package on Arch Linux, you can write the following to mkosi.conf.d/10-arch.conf:

[Match]
Distribution=arch

[Content]
Packages=pacman

Because not everything you need will be supported in mkosi, we support running scripts at various points during the image build process where all extra image customization can be done. For example, if it is found, mkosi.postinst is called after packages have been installed. Scripts are executed on the host system by default (in a sandbox), but can be executed inside the image by suffixing the script with .chroot, so if mkosi.postinst.chroot is found it will be executed inside the image.

To add extra files to the image, you can place them in mkosi.extra in the source directory and they will be automatically copied into the image after packages have been installed.

Bootable images

If the necessary packages are installed, mkosi will automatically generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it will always build UKIs (Unified Kernel Images), except if the image is BIOS-only (since UKIs cannot be used on BIOS). The initramfs is built like a regular image by installing distribution packages and packaging them up in a CPIO archive instead of a disk image. Specifically, we do not use dracut, mkinitcpio or initramfs-tools to generate the initramfs from the host system. ukify is used to assemble all the individual components into a UKI.

If you don't want mkosi to generate a bootable image, you can set Bootable=no to explicitly disable this logic.

Using mkosi for development

The main requirements to use mkosi for development is that we can build our source code against the image we're building and install it into the image we're building. mkosi supports this via build scripts. If a script named mkosi.build (or mkosi.build.chroot) is found, we'll execute it as part of the build. Any files put by the build script into $DESTDIR will be installed into the image. Required build dependencies can be installed using the BuildPackages= setting. These packages are installed into an overlay which is put on top of the image when running the build script so the build packages are available when running the build script but don't end up in the final image.

An example mkosi.build.chroot script for a project using meson could look as follows:

#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
    meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"

Now, every time the image is built, the build script will be executed and the results will be installed into the image.

The $BUILDDIR environment variable points to a directory that can be used as the build directory for build artifacts to allow for incremental builds if the build system supports it.

Of course, downloading all packages from scratch every time and re-installing them again every time the image is built is rather slow, so mkosi supports two modes of caching to speed things up.

The first caching mode caches all downloaded packages so they don't have to be downloaded again on subsequent builds. Enabling this is as simple as running mkdir mkosi.cache.

The second mode of caching caches the image after all packages have been installed but before running the build script. On subsequent builds, mkosi will copy the cache instead of reinstalling all packages from scratch. This mode can be enabled using the Incremental= setting. While there is some rudimentary cache invalidation, the cache can also forcibly be rebuilt by specifying -ff on the command line instead of -f.

Note that when running on a btrfs filesystem, mkosi will automatically use subvolumes for the cached images which can be snapshotted on subsequent builds for even faster rebuilds. We'll also use reflinks to do copy-on-write copies where possible.

With this setup, by running mkosi -f qemu in the systemd repository, it takes about 40 seconds to go from a source code change to a root shell in a virtual machine running the latest systemd with your change applied. This makes it very easy to test changes to systemd in a safe environment without risk of breaking your host system.

Of course, while 40 seconds is not a very long time, it's still more than we'd like, especially if all we're doing is modifying the kernel command line. That's why we have the KernelCommandLineExtra= option to configure kernel command line options that are passed to the container or virtual machine at runtime instead of being embedded into the image. These extra kernel command line options are picked up when the image is booted with qemu's direct kernel boot (using -append), but also when booting a disk image in UEFI mode (using SMBIOS). The same applies to systemd credentials (using the Credentials= setting). These settings allow configuring the image without having to rebuild it, which means that you only have to run mkosi qemu or mkosi boot again afterwards to apply the new settings.

Building images without root privileges and loop devices

By using newuidmap/newgidmap and systemd-repart, mkosi is able to build images without needing root privileges. As long as proper subuid and subgid mappings are set up for your user in /etc/subuid and /etc/subgid, you can run mkosi as your regular user without having to switch to root.

Note that as of the writing of this blog post this only applies to the build and qemu verbs. Booting the image in a systemd-nspawn container with mkosi boot still needs root privileges. We're hoping to fix this in an future systemd release.

Regardless of whether you're running mkosi with root or without root, almost every tool we execute is invoked in a sandbox to isolate as much of the build process from the host as possible. For example, /etc and /var from the host are not available in this sandbox, to avoid host configuration inadvertently affecting the build.

Because systemd-repart can build disk images without loop devices, mkosi can run from almost any environment, including containers. All that's needed is a UID range with 65536 UIDs available, either via running as the root user or via /etc/subuid and newuidmap. In a future systemd release, we're hoping to provide an alternative to newuidmap and /etc/subuid to allow running mkosi from all containers, even those with only a single UID available.

Supporting older distributions

mkosi depends on very recent versions of various systemd tools (v254 or newer). To support older distributions, we implemented so called tools trees. In short, mkosi can first build a tools image for you that contains all required tools to build the actual image. This can be enabled by adding ToolsTree=default to your mkosi configuration. Building a tools image does not require a recent version of systemd.

In the systemd mkosi configuration, we automatically use a tools tree if we detect your distribution does not have the minimum required systemd version installed.

Configuring variants of the same image using profiles

Profiles can be defined in the mkosi.profiles/ directory. The profile to use can be selected using the Profile= setting (or --profile=) on the command line. A profile allows you to bundle various settings behind a single recognizable name. Profiles can also be matched on if you want to apply some settings only to a few profiles.

For example, you could have a bootable profile that sets Bootable=yes, adds the linux and systemd-boot packages and configures Format=disk to end up with a bootable disk image when passing --profile bootable on the kernel command line.

Building system extension images

System extension images may – dynamically at runtime — extend the base system with an overlay containing additional files.

To build system extensions with mkosi, we need a base image on top of which we can build our extension.

To keep things manageable, we'll make use of mkosi's support for building multiple images so that we can build our base image and system extension in one go.

We start by creating a temporary directory with a base configuration file mkosi.conf with some shared settings:

[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache

Now let's continue with the base image definition by writing the following to mkosi.images/base/mkosi.conf:

[Output]
Format=directory

[Content]
CleanPackageMetadata=no
Packages=systemd
         udev

We use the directory output format here instead of the disk output so that we can build our extension without needing root privileges.

Now that we have our base image, we can define a sysext that builds on top of it by writing the following to mkosi.images/btrfs/mkosi.conf:

[Config]
Dependencies=base

[Output]
Format=sysext
Overlay=yes

[Content]
BaseTrees=%O/base
Packages=btrfs-progs

BaseTrees= point to our base image and Overlay=yes instructs mkosi to only package the files added on top of the base tree.

We can't sign the extension image without a key. We can generate one by running mkosi genkey which will generate files that are automatically picked up when building the image.

Finally, you can build the base image and the extensions by running mkosi -f. You'll find btrfs.raw in mkosi.output which is the extension image.

Various other interesting features

  • To sign any generated UKIs for secure boot, put your secure boot key and certificate in mkosi.key and mkosi.crt and enable the SecureBoot= setting. You can also run mkosi genkey to have mkosi generate a key and certificate itself.
  • The Ephemeral= setting can be enabled to boot the image in an ephemeral copy that is thrown away when the container or virtual machine exits.
  • ShimBootloader= and BiosBootloader= settings are available to configure shim and grub installation if needed.
  • mkosi can boot directory trees in a virtual using virtiofsd. This is very useful for quickly rebuilding an image and booting it as the image does not have to be packed up as a disk image.
  • ...

There's many more features that we won't go over in detail here in this blog post. Learn more about those by reading the documentation.

Conclusion

I'll finish with a bunch of links to more information about mkosi and related tooling:

January 08, 2024

Slow Start

It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games.

Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is my old code the mesa bug tracker.

Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins.

Another Game I’ve Never Heard Of

Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain.

Next bug up is from this game called Valheim. I think it’s a LARPing chess game? Something like that? Don’t @ me.

This report came in hot over the break with some rad new shading techniques:

hm

It looks way cooler if you play the trace, but you get the idea.

Pinpoint Accuracy

First question: what in the Sam Hill is going on here?

Apparently RADV_DEBUG=hang fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “PLZ SEND REVIEWS!!” Pitoiset) this flag synchronizes the queue after every submit.

It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost.

My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present.

Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on zink_flush(), which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from glFenceSync, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing.

So I saw these calls coming in, and I stepped through zink_flush(), and I reached this spot:

if (!batch->has_work) {
<-----HERE
      if (pfence) {
         /* reuse last fence */
         fence = ctx->last_fence;
      }
      if (!deferred) {
         struct zink_batch_state *last = zink_batch_state(ctx->last_fence);
         if (last) {
            sync_flush(ctx, last);
            if (last->is_device_lost)
               check_device_lost(ctx);
         }
      }
      if (ctx->tc && !ctx->track_renderpasses)
      tc_driver_internal_flush_notify(ctx->tc);
} else {
   fence = &batch->state->fence;
   submit_count = batch->state->usage.submit_count;
   if (deferred && !(flags & PIPE_FLUSH_FENCE_FD) && pfence)
      deferred_fence = true;
   else
      flush_batch(ctx, true);
}

Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who don’t know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush.

OR DO YOU?

For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect.

Synchronization: You Cannot Escape

The reason this was especially puzzling is the call sequence was:

  • end-of-frame flush
  • present
  • glFenceSync flush

which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these should be identical in terms of operations the app would want to wait on.

Except that there’s a present in there, and technically that’s a queue submission, and technically something might want to know if the submit for that has completed?

Why yes, that is stupid, but here at SGC, stupidity is our sustenance.

Anyway, I blasted out a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.

January 02, 2024

This Is It.

It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS.

—is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year.

It’s the year a decades-old plan has finally yielded its dividends.

Truth.

You’ve all heard certain improbable claims before. Big Triangle this. Big Triangle that. Everyone knows who they are. Some have even accused me of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world.

I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks.

In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal.

The goal of infiltrating Big Triangle.

More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose.

Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources.

I have become an officer.

itsreal.png

I am the chair.

Revolution.

Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers!

  • OpenGL 10.0 by 2025!

  • Compatibility Profile shall be renamed ‘SLOW MODE’

  • OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases!

  • Depth values shall be uniformly scaled across all hardware and platforms!

  • XFB shall be outlawed!

  • Linux game ports shall no longer link to LLVM!

  • Coherent API error messages shall be printed!

  • Vendors which cannot ship functional Windows GL drivers shall ship Zink!

  • Native GL drivers on mobile platforms shall be outlawed!

  • gl_PointSize shall be replaced by the constant ‘1.0’ in all cases!

  • Mesh and ray-tracing extensions from NVIDIA shall become core functionality!

  • GLX shall be deleted and forgotten!

  • All bug reports shall contain at least one quality meme in the OP as a form of spam prevention!

Rise up and join me, your new GL/ES chair, in the glorious revolution!

DISCLAIMER

Obviously this is all a joke (except the part where I’m the 🪑, that’s 100% real af), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously.

Happy New Year. I missed you.

December 26, 2023

Holidays are here and I have time to look back at 2023. For six months I have been working for Igalia and what should I say?

I ❤️ it!

This was the best decision to leave my comfort zone of a normal 9-5 job. I am so proud to work on open source GPU drivers and I am able to spend much of my work time on etnaviv.

Driver maintenance

Before adding any new feature I thought it would be great idea to improve the current state of etnaviv’s gallium driver. Therefor I reworked some general driver code to be more consistent and to have a more modern feeling, and made it possible to drop some hand-rolled conversion helpers by switching to already existing solutions (U_FIXED(..), S_FIXED(..), float_to_ubyte(..)).

I worked through the low hanging fruits of crashes seen in CI runs and fixed many of them.

Feature wise, I also looked at some easy to implement extensions like GL_NV_conditional_render and GL_OES_texture_half_float_linear.

Besides the gallium driver I also worked on some NIR and isaspec features that are beneficial for etnaviv.

XDC2023

A personal highlight was to give a talk about etnaviv at XDC2023 in person.

You might wonder what happened since mid October in etnaviv land.

GLES3

I worked on some features that are needed to expose GLES3 and it turned out that an easy to maintain, extend and test compiler backend is needed. Sadly etnaviv’s current backend compiler does not check any of these boxes. It is so fragile that I only added some needed lowerings to pass some of the dEQP-GLES3.functional.shaders.texture_functions.* tests.

Some more fun work regarding some feature emulation is on the horizon and it’s blocked again by the current compiler.

Backend Compiler

etnaviv includes an isaspec powered disassembler now - a small step towards a new backend compiler. Next on the road to success is the etnaviv backend IR with an assembler.

The new backend compiler is able to run OpenCL kernels with the help of rusticl but I want to land the new backend compiler in smaller chunks that are easier to review.

Multiple Render Targets

During my XDC presentation I talked about a feature I got working on GC7000L - Multiple Render Targets (MRT). At this point it was more or less a proof-of-concept regarding the gallium drivers. There were some missing bits and register for full support on more GPU models and therefore more reverse engineering work was needed. Also the gallium driver needed lots of work to add support for MRT.

Some weeks later I had MRT working on a wider range of Vivante GPUs that are supporting this feature. This includes GC2000, GC3000 and GC7000 models among others. As etnaviv makes heavy use of GPU features it should work on even more models.

Looking forward to 2024

I am really confident that we will see GLES3 and OpenCL for etnaviv. As driver testing is quite important for my work I will expand my current board farm and will look into the new star in CI world - ci-tron.

With that, have a happy holiday season and we’ll be back with more improvements in 2024!

December 22, 2023

Last year I wrote a recap of the Vulkan extensions Igalia helped ship in 2022, and in this post I’ll do the exact same for 2023.

Igalia Logo next to the Vulkan Logo

For context and quoting the previous recap:

The ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions.

In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own.

So, without further ado, this is the list of extensions we helped ship in 2023.

VK_EXT_attachment_feedback_loop_dynamic_state

This extension builds on last year’s VK_EXT_attachment_feedback_loop_layout, which is used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. The new extension shipped this year adds support for setting attachment feedback loops dynamically on command buffers. As all extensions that add more dynamic state, the goal here is to reduce the number of pipeline objects applications need to create, which makes using the API more flexible. It was created by our beloved super good coder and Valve contractor Mike Blumenkrantz. We reviewed the spec and are listed as contributors, and we wrote dynamic variants of the existing CTS tests.

VK_EXT_depth_bias_control

A new extension proposed by Joshua Ashton that also helps with layering D3D9 on top of Vulkan. The original problem is quite specific. In D3D9 and other APIs, applications can specify what is called a “depth bias” for geometry using an offset that is to be added directly as an exact value to the original depth of each fragment. In Vulkan, however, the depth bias is expressed as a factor of “r”, where “r” is a number that depends on the depth buffer format and, furthermore, may not have a specific fixed value. Implementations can use different values of “r” in an acceptable range. The mechanism provided by Vulkan without this extension is useful to apply small offsets and solve some problems, but it’s not useful to apply large offsets and/or emulate D3D9 by applying a fixed-value bias. The new extension solves these problems by giving apps the chance to control depth bias in a precise way. We reviewed the spec and are listed as contributors, and wrote CTS tests for this extension to help ship it.

VK_EXT_dynamic_rendering_unused_attachments

This extension was proposed by Piers Daniell from NVIDIA to lift some restrictions in the original VK_KHR_dynamic_rendering extension, which is used in Vulkan to avoid having to create render passes and framebuffer objects. Dynamic rendering is very interesting because it makes the API much easier to use and, in many cases and specially in desktop platforms, it can be shipped without any associated performance loss. The new extension relaxes some restrictions that made pipelines more tightly coupled with render pass instances. Again, the goal here is to be able to reuse the same pipeline object with multiple render pass instances and remove some combinatorial explosions that may occur in some apps. We reviewed the spec and are listed as contributors, and wrote CTS tests for the new extension.

VK_EXT_image_sliced_view_of_3d

Shipped at the beginning of the year by Mike Blumenkrantz, the extension again helps emulating other APIs on top of Vulkan. Specifically, the extension allows creating 3D views of 3D images such that the views contain a subset of the slices in the image, using a Z offset and range, in the same way D3D12 allows. We reviewed the spec, we’re listed as contributors, and we wrote CTS tests for it.

VK_EXT_pipeline_library_group_handles

This one comes from Valve contractor Hans-Kristian Arntzen, who is mostly known for working on Proton projects like VKD3D-Proton. The extension is related to ray tracing and adds more flexibility when creating ray tracing pipelines. Ray tracing pipelines can hold thousands of different shaders and are sometimes built incrementally by combining so-called pipeline libraries that contain subsets of those shaders. However, to properly use those pipelines we need to create a structure called a shader binding table, which is full of shader group handles that have to be retrieved from pipelines. Prior to this extension, shader group handles from pipeline libraries had to be requeried once the final pipeline is linked, as they were not guaranteed to be constant throughout the whole process. With this extension, an implementation can tell apps they will not modify shader group handles in subsequent link steps, which makes it easier for apps to build shader binding tables. More importantly, this also more closely matches functionality in DXR 1.1, making it easier to emulate DirectX Raytracing on top of Vulkan raytracing. We reviewed the spec, we’re listed as contributors and we wrote CTS tests for it.

VK_EXT_shader_object

Shader objects is probably the most notorious extension shipped this year, and we contributed small bits to it. This extension makes every piece of state dynamic and removes the need to use pipelines. It’s always used in combination with dynamic rendering, which also removes render passes and framebuffers as explained above. This results in great flexibility from the application point of view. The extension was created by Daniel Story from Nintendo, and its vast set of CTS tests was created by Žiga Markuš but we added our grain of sand by reviewing the spec and proposing some changes (which is why we’re listed as contributors), as well as fixing some shader object tests and providing some improvements here and there once they had been merged. A good part of this work was done in coordination with Mesa developers which were working on implementing this extension for different drivers.

VK_KHR_video_encode_h264 and VK_KHR_video_encode_h265

Fresh out of the oven, these Vulkan Video extensions allow leveraging the hardware to efficiently encode H.264 and H.265 streams. This year we’ve been doing a ton of work related to Vulkan Video in drivers, libraries like GStreamer and CTS/spec, including the two extensions mentioned above. Although not listed as contributors to the spec in those two Vulkan extensions, our work played a major role in advancing the state of Vulkan Video and getting them shipped.

Epilogue

That’s it for this year! I’m looking forward to help ship more extension work the next one and trying to add my part in making Vulkan drivers on Linux (and other platforms!) more stable and feature rich. My Vulkan Video colleagues at Igalia have already started work on future Vulkan Video extensions for AV1 and VP9. Hopefully some of that work is ratified next year. Fingers crossed!

December 21, 2023

"Don't cross the streams. It would be bad."

IR refactorings

A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.

What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.

The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:

A small subsection of SSDLite MobileDet

The adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.

As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.

For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.

I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.

For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.

Tensor addition

Also put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.

Lowering the strides

Similarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.

Status and next steps

Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.

The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.

The whole of SSDLite MobileDet

 

December 20, 2023

Last week marked a major milestone for me: the AMD driver-specific color management properties reached the upstream linux-next!

And to celebrate, I’m happy to share the slides notes from my 2023 XDC talk, “The Rainbow Treasure Map” along with the individual recording that just dropped last week on youtube – talk about happy coincidences!

Steam Deck Rainbow: Treasure Map & Magic Frogs

While I may be bubbly and chatty in everyday life, the stage isn’t exactly my comfort zone (hallway talks are more my speed). But the journey of developing the AMD color management properties was so full of discoveries that I simply had to share the experience. Witnessing the fantastic work of Jeremy and Joshua bring it all to life on the Steam Deck OLED was like uncovering magical ingredients and whipping up something truly enchanting.

For XDC 2023, we split our Rainbow journey into two talks. My focus, “The Rainbow Treasure Map,” explored the new color features we added to the Linux kernel driver, diving deep into the hardware capabilities of AMD/Steam Deck. Joshua then followed with “The Rainbow Frogs” and showed the breathtaking color magic released on Gamescope thanks to the power unlocked by the kernel driver’s Steam Deck color properties.

Packing a Rainbow into 15 Minutes

I had so much to tell, but a half-slot talk meant crafting a concise presentation. To squeeze everything into 15 minutes (and calm my pre-talk jitters a bit!), I drafted and practiced those slides and notes countless times.

So grab your map, and let’s embark on the Rainbow journey together!

Slide 1: The Rainbow Treasure Map - Advanced Color Management on Linux with AMD/SteamDeck

Intro: Hi, I’m Melissa from Igalia and welcome to the Rainbow Treasure Map, a talk about advanced color management on Linux with AMD/SteamDeck.

Slide 2: List useful links for this technical talk

Useful links: First of all, if you are not used to the topic, you may find these links useful.

  1. XDC 2022 - I’m not an AMD expert, but… - Melissa Wen
  2. XDC 2022 - Is HDR Harder? - Harry Wentland
  3. XDC 2022 Lightning - HDR Workshop Summary - Harry Wentland
  4. Color management and HDR documentation for FOSS graphics - Pekka Paalanen et al.
  5. Cinematic Color - 2012 SIGGRAPH course notes - Jeremy Selan
  6. AMD Driver-specific Properties for Color Management on Linux (Part 1) - Melissa Wen

Slide 3: Why do we need advanced color management on Linux?

Context: When we talk about colors in the graphics chain, we should keep in mind that we have a wide variety of source content colorimetry, a variety of output display devices and also the internal processing. Users expect consistent color reproduction across all these devices.

The userspace can use GPU-accelerated color management to get it. But this also requires an interface with display kernel drivers that is currently missing from the DRM/KMS framework.

Slide 4: Describe our work on AMD driver-specific color properties

Since April, I’ve been bothering the DRM community by sending patchsets from the work of me and Joshua to add driver-specific color properties to the AMD display driver. In parallel, discussions on defining a generic color management interface are still ongoing in the community. Moreover, we are still not clear about the diversity of color capabilities among hardware vendors.

To bridge this gap, we defined a color pipeline for Gamescope that fits the latest versions of AMD hardware. It delivers advanced color management features for gamut mapping, HDR rendering, SDR on HDR, and HDR on SDR.

Slide 5: Describe the AMD/SteamDeck - our hardware

AMD/Steam Deck hardware: AMD frequently releases new GPU and APU generations. Each generation comes with a DCN version with display hardware improvements. Therefore, keep in mind that this work uses the AMD Steam Deck hardware and its kernel driver. The Steam Deck is an APU with a DCN3.01 display driver, a DCN3 family.

It’s important to have this information since newer AMD DCN drivers inherit implementations from previous families but aldo each generation of AMD hardware may introduce new color capabilities. Therefore I recommend you to familiarize yourself with the hardware you are working on.

Slide 6: Diagram with the three layers of the AMD display driver on Linux

The AMD display driver in the kernel space: It consists of three layers, (1) the DRM/KMS framework, (2) the AMD Display Manager, and (3) the AMD Display Core. We extended the color interface exposed to userspace by leveraging existing DRM resources and connecting them using driver-specific functions for color property management.

Slide 7: Three-layers diagram highlighting AMD Display Manager, DM - the layer that connects DC and DRM

Bridging DC color capabilities and the DRM API required significant changes in the color management of AMD Display Manager - the Linux-dependent part that connects the AMD DC interface to the DRM/KMS framework.

Slide 8: Three-layers diagram highlighting AMD Display Core, DC - the shared code

The AMD DC is the OS-agnostic layer. Its code is shared between platforms and DCN versions. Examining this part helps us understand the AMD color pipeline and hardware capabilities, since the machinery for hardware settings and resource management are already there.

Slide 9: Diagram of the AMD Display Core Next architecture with main elements and data flow

The newest architecture for AMD display hardware is the AMD Display Core Next.

Slide 10: Diagram of the AMD Display Core Next where only DPP and MPC blocks are highlighted

In this architecture, two blocks have the capability to manage colors:

  • Display Pipe and Plane (DPP) - for pre-blending adjustments;
  • Multiple Pipe/Plane Combined (MPC) - for post-blending color transformations.

Let’s see what we have in the DRM API for pre-blending color management.

Slide 11: Blank slide with no content only a title 'Pre-blending: DRM plane'

DRM plane color properties:

This is the DRM color management API before blending.

Nothing!

Except two basic DRM plane properties: color_encoding and color_range for the input colorspace conversion, that is not covered by this work.

Slide 12: Diagram with color capabilities and structures in AMD DC layer without any DRM plane color interface (before blending), only the DRM CRTC color interface for post blending

In case you’re not familiar with AMD shared code, what we need to do is basically draw a map and navigate there!

We have some DRM color properties after blending, but nothing before blending yet. But much of the hardware programming was already implemented in the AMD DC layer, thanks to the shared code.

Slide 13: Previous Diagram with a rectangle to highlight the empty space in the DRM plane interface that will be filled by AMD plane properties

Still both the DRM interface and its connection to the shared code were missing. That’s when the search begins!

Slide 14: Color Pipeline Diagram with the plane color interface filled by AMD plane properties but without connections to AMD DC resources

AMD driver-specific color pipeline:

Looking at the color capabilities of the hardware, we arrive at this initial set of properties. The path wasn’t exactly like that. We had many iterations and discoveries until reached to this pipeline.

Slide 15: Color Pipeline Diagram connecting AMD plane degamma properties, LUT and TF, to AMD DC resources

The Plane Degamma is our first driver-specific property before blending. It’s used to linearize the color space from encoded values to light linear values.

Slide 16: Describe plane degamma properties and hardware capabilities

We can use a pre-defined transfer function or a user lookup table (in short, LUT) to linearize the color space.

Pre-defined transfer functions for plane degamma are hardcoded curves that go to a specific hardware block called DPP Degamma ROM. It supports the following transfer functions: sRGB EOTF, BT.709 inverse OETF, PQ EOTF, and pure power curves Gamma 2.2, Gamma 2.4 and Gamma 2.6.

We also have a one-dimensional LUT. This 1D LUT has four thousand ninety six (4096) entries, the usual 1D LUT size in the DRM/KMS. It’s an array of drm_color_lut that goes to the DPP Gamma Correction block.

Slide 17: Color Pipeline Diagram connecting AMD plane CTM property to AMD DC resources

We also have now a color transformation matrix (CTM) for color space conversion.

Slide 18: Describe plane CTM property and hardware capabilities

It’s a 3x4 matrix of fixed points that goes to the DPP Gamut Remap Block.

Both pre- and post-blending matrices were previously gone to the same color block. We worked on detaching them to clear both paths.

Now each CTM goes on its own way.

Slide 19: Color Pipeline Diagram connecting AMD plane HDR multiplier property to AMD DC resources

Next, the HDR Multiplier. HDR Multiplier is a factor applied to the color values of an image to increase their overall brightness.

Slide 20: Describe plane HDR mult property and hardware capabilities

This is useful for converting images from a standard dynamic range (SDR) to a high dynamic range (HDR). As it can range beyond [0.0, 1.0] subsequent transforms need to use the PQ(HDR) transfer functions.

Slide 21: Color Pipeline Diagram connecting AMD plane shaper properties, LUT and TF, to AMD DC resources

And we need a 3D LUT. But 3D LUT has a limited number of entries in each dimension, so we want to use it in a colorspace that is optimized for human vision. It means in a non-linear space. To deliver it, userspace may need one 1D LUT before 3D LUT to delinearize content and another one after to linearize content again for blending.

Slide 22: Describe plane shaper properties and hardware capabilities

The pre-3D-LUT curve is called Shaper curve. Unlike Degamma TF, there are no hardcoded curves for shaper TF, but we can use the AMD color module in the driver to build the following shaper curves from pre-defined coefficients. The color module combines the TF and the user LUT values into the LUT that goes to the DPP Shaper RAM block.

Slide 23: Color Pipeline Diagram connecting AMD plane 3D LUT property to AMD DC resources

Finally, our rockstar, the 3D LUT. 3D LUT is perfect for complex color transformations and adjustments between color channels.

Slide 24: Describe plane 3D LUT property and hardware capabilities

3D LUT is also more complex to manage and requires more computational resources, as a consequence, its number of entries is usually limited. To overcome this restriction, the array contains samples from the approximated function and values between samples are estimated by tetrahedral interpolation. AMD supports 17 and 9 as the size of a single-dimension. Blue is the outermost dimension, red the innermost.

Slide 25: Color Pipeline Diagram connecting AMD plane blend properties, LUT and TF, to AMD DC resources

As mentioned, we need a post-3D-LUT curve to linearize the color space before blending. This is done by Blend TF and LUT.

Slide 26: Describe plane blend properties and hardware capabilities

Similar to shaper TF, there are no hardcoded curves for Blend TF. The pre-defined curves are the same as the Degamma block, but calculated by the color module. The resulting LUT goes to the DPP Blend RAM block.

Slide 27: Color Pipeline Diagram  with all AMD plane color properties connect to AMD DC resources and links showing the conflict between plane and CRTC degamma

Now we have everything connected before blending. As a conflict between plane and CRTC Degamma was inevitable, our approach doesn’t accept that both are set at the same time.

Slide 28: Color Pipeline Diagram connecting AMD CRTC gamma TF property to AMD DC resources

We also optimized the conversion of the framebuffer to wire encoding by adding support to pre-defined CRTC Gamma TF.

Slide 29: Describe CRTC gamma TF property and hardware capabilities

Again, there are no hardcoded curves and TF and LUT are combined by the AMD color module. The same types of shaper curves are supported. The resulting LUT goes to the MPC Gamma RAM block.

Slide 30: Color Pipeline Diagram with all AMD driver-specific color properties connect to AMD DC resources

Finally, we arrived in the final version of DRM/AMD driver-specific color management pipeline. With this knowledge, you’re ready to better enjoy the rainbow treasure of AMD display hardware and the world of graphics computing.

Slide 31: SteamDeck/Gamescope Color Pipeline Diagram with rectangles labeling each block of the pipeline with the related AMD color property

With this work, Gamescope/Steam Deck embraces the color capabilities of the AMD GPU. We highlight here how we map the Gamescope color pipeline to each AMD color block.

Slide 32: Final slide. Thank you!

Future works: The search for the rainbow treasure is not over! The Linux DRM subsystem contains many hidden treasures from different vendors. We want more complex color transformations and adjustments available on Linux. We also want to expose all GPU color capabilities from all hardware vendors to the Linux userspace.

Thanks Joshua and Harry for this joint work and the Linux DRI community for all feedback and reviews.

The amazing part of this work comes in the next talk with Joshua and The Rainbow Frogs!

Any questions?


References:

  1. Slides of the talk The Rainbow Treasure Map.
  2. Youtube video of the talk The Rainbow Treasure Map.
  3. Patch series for AMD driver-specific color management properties (upstream Linux 6.8v).
  4. SteamDeck/Gamescope color management pipeline
  5. XDC 2023 website.
  6. Igalia website.
December 19, 2023

Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.

The latest branch is at [1]. This branch has been updated for the new final headers.

Updated: It passes all of h265 CTS now, but it is failing one h264 test.

Initial ffmpeg support is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads

[2] https://github.com/cyanreg/FFmpeg/commits/vulkan/

December 17, 2023

Hi all!

This month we’ve finally released wlroots 0.17.0! It’s been a long time since the previous release (1 year), we’ll try to ship future releases a bit more frequently. We’re preparing 0.17.1 with a collection of bugfixes, it should be ready soon.

I’ve been working on wlr_surface_synced, a new wlroots abstraction to allow surface commits coming from clients to be delayed. This is required to avoid stalling the whole compositor if a client GPU work is slow and to implement explicit synchronization. I’ve also been working on a commit-queue-v1 implementation for wlroots and gamescope, which will allow us to get rid of a CPU wait in Mesa. And I’ve put some finishing touches on Rose’s frame scheduler patches. Last, I’ve merged André Almeida’s kernel patches for atomic async page-flips, making it so modern compositors can enable tearing page-flips without having to go through the legacy KMS uAPI.

I’ve added OAuth refresh tokens to meta.sr.ht. Having to renew OAuth tokens every year on my clients is annoying, with refresh tokens that’s a thing of the past! I’ve already updated hottub (CI bridge for GitHub) to leverage this, and I’d like to also implement this in hut (CLI tool) and yojo (CI bridge for Codeberg). Note that since meta.sr.ht has only now started returning refresh tokens on login, users will need to re-login one last time so that the OAuth clients can grab the refresh token.

The NPotM is a bit peculiar: I haven’t actually started working on it this month, and it’s not in a usable state yet. It’s go-sqlgen, a Go code generator which takes SQL as input. The goal is to store SQL queries in a separate file, to make them safer (type checking for the arguments) and faster (prepared statements). It’s somewhat similar to sqlc except it aims at being simpler and database-agnostic. There’s still much to do: I’d like to add support for named parameters, check that the number of parameters in the query matches the number of procedure arguments, and make it easy to write migrations. I’m not yet sure go-sqlgen is worth the trouble: being database-agnostic limits its abilities, perhaps too much.

Then comes the usual mix of random smaller updates. I’ve released soju 0.7.0 and goguma 0.6.0 with a few new features and bugfixes. pyonji now understands the b4 config file, so it’s possible to add this file to your project to preconfigure pyonji with a mailing list (example). delthas has implemented account data import in hut, so it’s now easy to migrate accounts between sr.ht instances, or projects between accounts. go-scfg now supports decoding a configuration file directly into a Go struct, making it unnecessary to hand-roll parsing code (example).

I’ll be giving a FOSDEM talk about quirks and gotchas of the IMAP protocol this year. I’ll be happy to say hi if any of you are coming as well. That’s all I have for this month, see you in January!

December 14, 2023

You may have seen the news that Red Hat Enterprise Linux 10 plans to remove Xorg. But Xwayland will stay around, and given the name overloading and them sharing a git repository there's some confusion over what is Xorg. So here's a very simple "picture". This is the xserver git repository:

$ tree -d -L 2 xserver
xserver
├── composite
├── config
├── damageext
├── dbe
├── dix
├── doc
│   └── dtrace
├── dri3
├── exa
├── fb
├── glamor
├── glx
├── hw
│   ├── kdrive
│   ├── vfb
│   ├── xfree86              <- this one is Xorg
│   ├── xnest
│   ├── xquartz
│   ├── xwayland
│   └── xwin
├── include
├── m4
├── man
├── mi
├── miext
│   ├── damage
│   ├── rootless
│   ├── shadow
│   └── sync
├── os
├── present
├── pseudoramiX
├── randr
├── record
├── render
├── test
│   ├── bigreq
│   ├── bugs
│   ├── damage
│   ├── scripts
│   ├── sync
│   ├── xi1
│   └── xi2
├── Xext
├── xfixes
├── Xi
└── xkb
The git repo produces several X servers, including the one designed to run on bare metal: Xorg (in hw/xfree86 for historical reasons). The other hw directories are the other X servers including Xwayland. All the other directories are core X server functionality that's shared between all X servers [1]. Removing Xorg from a distro but keeping Xwayland means building with --disable-xfree86 -enable-xwayland [1]. That's simply it (plus the resulting distro packaging work of course).

Removing Xorg means you need something else that runs on bare metal and that is your favourite Wayland compositor. Xwayland then talks to that while presenting an X11-compatible socket to existing X11 applications.

Of course all this means that the X server repo will continue to see patches and many of those will also affect Xorg. For those who are running git master anyway. Don't get your hopes up for more Xorg releases beyond the security update background noise [2].

Xwayland on the other hand is actively maintained and will continue to see releases. But those releases are a sequence [1] of

$ git new-branch xwayland-23.x.y
$ git rm hw/{kdrive/vfb/xfree86/xnest,xquartz,xwin}
$ git tag xwayland-23.x.y
In other words, an Xwayland release is the xserver git master branch with all X servers but Xwayland removed. That's how Xwayland can see new updates and releases without Xorg ever seeing those (except on git master of course). And that's how your installed Xwayland has code from 2023 while your installed Xorg is still stuck on the branch created and barely updated after 2021.

I hope this helps a bit with the confusion of the seemingly mixed messages sent when you see headlines like "Xorg is unmaintained", "X server patches to fix blah", "Xorg is abandoned", "new Xwayland release.

[1] not 100% accurate but close enough
[2] historically an Xorg release included all other X servers (Xquartz, Xwin, Xvfb, ...) too so this applies to those servers too unless they adopt the Xwayland release model

December 13, 2023

A self-help guide for examining and debugging the AMD display driver within the Linux kernel/DRM subsystem.

It’s based on my experience as an external developer working on the driver, and are shared with the goal of helping others navigate the driver code.

Acknowledgments: These tips were gathered thanks to the countless help received from AMD developers during the driver development process. The list below was obtained by examining open source code, reviewing public documentation, playing with tools, asking in public forums and also with the help of my former GSoC mentor, Rodrigo Siqueira.

Pre-Debugging Steps:

Before diving into an issue, it’s crucial to perform two essential steps:

1) Check the latest changes: Ensure you’re working with the latest AMD driver modifications located in the amd-staging-drm-next branch maintained by Alex Deucher. You may also find bug fixes for newer kernel versions on branches that have the name pattern drm-fixes-<date>.

2) Examine the issue tracker: Confirm that your issue isn’t already documented and addressed in the AMD display driver issue tracker. If you find a similar issue, you can team up with others and speed up the debugging process.

Understanding the issue:

Do you really need to change this? Where should you start looking for changes?

3) Is the issue in the AMD kernel driver or in the userspace?: Identifying the source of the issue is essential regardless of the GPU vendor. Sometimes this can be challenging so here are some helpful tips:

  • Record the screen: Capture the screen using a recording app while experiencing the issue. If the bug appears in the capture, it’s likely a userspace issue, not the kernel display driver.
  • Analyze the dmesg log: Look for error messages related to the display driver in the dmesg log. If the error message appears before the message “[drm] Display Core v...”, it’s not likely a display driver issue. If this message doesn’t appear in your log, the display driver wasn’t fully loaded and you will see a notification that something went wrong here.

4) AMD Display Manager vs. AMD Display Core: The AMD display driver consists of two components:

  • Display Manager (DM): This component interacts directly with the Linux DRM infrastructure. Occasionally, issues can arise from misinterpretations of DRM properties or features. If the issue doesn’t occur on other platforms with the same AMD hardware - for example, only happens on Linux but not on Windows - it’s more likely related to the AMD DM code.
  • Display Core (DC): This is the platform-agnostic part responsible for setting and programming hardware features. Modifications to the DC usually require validation on other platforms, like Windows, to avoid regressions.

5) Identify the DC HW family: Each AMD GPU has variations in its hardware architecture. Features and helpers differ between families, so determining the relevant code for your specific hardware is crucial.

  • Find GPU product information in Linux/AMD GPU documentation
  • Check the dmesg log for the Display Core version (since this commit in Linux kernel 6.3v). For example:
    • [drm] Display Core v3.2.241 initialized on DCN 2.1
    • [drm] Display Core v3.2.237 initialized on DCN 3.0.1

Investigating the relevant driver code:

Keep from letting unrelated driver code to affect your investigation.

6) Narrow the code inspection down to one DC HW family: the relevant code resides in a directory named after the DC number. For example, the DCN 3.0.1 driver code is located at drivers/gpu/drm/amd/display/dc/dcn301. We all know that the AMD’s shared code is huge and you can use these boundaries to rule out codes unrelated to your issue.

7) Newer families may inherit code from older ones: you can find dcn301 using code from dcn30, dcn20, dcn10 files. It’s crucial to verify which hooks and helpers your driver utilizes to investigate the right portion. You can leverage ftrace for supplemental validation. To give an example, it was useful when I was updating DCN3 color mapping to correctly use their new post-blending color capabilities, such as:

Additionally, you can use two different HW families to compare behaviours. If you see the issue in one but not in the other, you can compare the code and understand what has changed and if the implementation from a previous family doesn’t fit well the new HW resources or design. You can also count on the help of the community on the Linux AMD issue tracker to validate your code on other hardware and/or systems.

This approach helped me debug a 2-year-old issue where the cursor gamma adjustment was incorrect in DCN3 hardware, but working correctly for DCN2 family. I solved the issue in two steps, thanks for community feedback and validation:

8) Check the hardware capability screening in the driver: You can currently find a list of display hardware capabilities in the drivers/gpu/drm/amd/display/dc/dcn*/dcn*_resource.c file. More precisely in the dcn*_resource_construct() function. Using DCN301 for illustration, here is the list of its hardware caps:

	/*************************************************
	 *  Resource + asic cap harcoding                *
	 *************************************************/
	pool->base.underlay_pipe_index = NO_UNDERLAY_PIPE;
	pool->base.pipe_count = pool->base.res_cap->num_timing_generator;
	pool->base.mpcc_count = pool->base.res_cap->num_timing_generator;
	dc->caps.max_downscale_ratio = 600;
	dc->caps.i2c_speed_in_khz = 100;
	dc->caps.i2c_speed_in_khz_hdcp = 5; /*1.4 w/a enabled by default*/
	dc->caps.max_cursor_size = 256;
	dc->caps.min_horizontal_blanking_period = 80;
	dc->caps.dmdata_alloc_size = 2048;
	dc->caps.max_slave_planes = 2;
	dc->caps.max_slave_yuv_planes = 2;
	dc->caps.max_slave_rgb_planes = 2;
	dc->caps.is_apu = true;
	dc->caps.post_blend_color_processing = true;
	dc->caps.force_dp_tps4_for_cp2520 = true;
	dc->caps.extended_aux_timeout_support = true;
	dc->caps.dmcub_support = true;

	/* Color pipeline capabilities */
	dc->caps.color.dpp.dcn_arch = 1;
	dc->caps.color.dpp.input_lut_shared = 0;
	dc->caps.color.dpp.icsc = 1;
	dc->caps.color.dpp.dgam_ram = 0; // must use gamma_corr
	dc->caps.color.dpp.dgam_rom_caps.srgb = 1;
	dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1;
	dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1;
	dc->caps.color.dpp.dgam_rom_caps.pq = 1;
	dc->caps.color.dpp.dgam_rom_caps.hlg = 1;
	dc->caps.color.dpp.post_csc = 1;
	dc->caps.color.dpp.gamma_corr = 1;
	dc->caps.color.dpp.dgam_rom_for_yuv = 0;

	dc->caps.color.dpp.hw_3d_lut = 1;
	dc->caps.color.dpp.ogam_ram = 1;
	// no OGAM ROM on DCN301
	dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
	dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.dpp.ogam_rom_caps.pq = 0;
	dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
	dc->caps.color.dpp.ocsc = 0;

	dc->caps.color.mpc.gamut_remap = 1;
	dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; //2
	dc->caps.color.mpc.ogam_ram = 1;
	dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
	dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
	dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
	dc->caps.color.mpc.ogam_rom_caps.pq = 0;
	dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
	dc->caps.color.mpc.ocsc = 1;

	dc->caps.dp_hdmi21_pcon_support = true;

	/* read VBIOS LTTPR caps */
	if (ctx->dc_bios->funcs->get_lttpr_caps) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_lttpr_enable = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, &is_vbios_lttpr_enable);
		dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) && !!is_vbios_lttpr_enable;
	}

	if (ctx->dc_bios->funcs->get_lttpr_interop) {
		enum bp_result bp_query_result;
		uint8_t is_vbios_interop_enabled = 0;

		bp_query_result = ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, &is_vbios_interop_enabled);
		dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) && !!is_vbios_interop_enabled;
	}

Keep in mind that the documentation of color capabilities are available at the Linux kernel Documentation.

Understanding the development history:

What has brought us to the current state?

9) Pinpoint relevant commits: Use git log and git blame to identify commits targeting the code section you’re interested in.

10) Track regressions: If you’re examining the amd-staging-drm-next branch, check for regressions between DC release versions. These are defined by DC_VER in the drivers/gpu/drm/amd/display/dc/dc.h file. Alternatively, find a commit with this format drm/amd/display: 3.2.221 that determines a display release. It’s useful for bisecting. This information helps you understand how outdated your branch is and identify potential regressions. You can consider each DC_VER takes around one week to be bumped. Finally, check testing log of each release in the report provided on the amd-gfx mailing list, such as this one Tested-by: Daniel Wheeler:

Reducing the inspection area:

Focus on what really matters.

11) Identify involved HW blocks: This helps isolate the issue. You can find more information about DCN HW blocks in the DCN Overview documentation. In summary:

  • Plane issues are closer to HUBP and DPP.
  • Blending/Stream issues are closer to MPC, OPP and OPTC. They are related to DRM CRTC subjects.

This information was useful when debugging a hardware rotation issue where the cursor plane got clipped off in the middle of the screen.

Finally, the issue was addressed by two patches:

12) Issues around bandwidth (glitches) and clocks: May be affected by calculations done in these HW blocks and HW specific values. The recalculation equations are found in the DML folder. DML stands for Display Mode Library. It’s in charge of all required configuration parameters supported by the hardware for multiple scenarios. See more in the AMD DC Overview kernel docs. It’s a math library that optimally configures hardware to find the best balance between power efficiency and performance in a given scenario.

Finding some clk variables that affect device behavior may be a sign of it. It’s hard for a external developer to debug this part, since it involves information from HW specs and firmware programming that we don’t have access. The best option is to provide all relevant debugging information you have and ask AMD developers to check the values from your suspicions.

  • Do a trick: If you suspect the power setup is degrading performance, try setting the amount of power supplied to the GPU to the maximum and see if it affects the system behavior with this command: sudo bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"

I learned it when debugging glitches with hardware cursor rotation on Steam Deck. My first attempt was changing the clock calculation. In the end, Rodrigo Siqueira proposed the right solution targeting bandwidth in two steps:

Checking implicit programming and hardware limitations:

Bring implicit programming to the level of consciousness and recognize hardware limitations.

13) Implicit update types: Check if the selected type for atomic update may affect your issue. The update type depends on the mode settings, since programming some modes demands more time for hardware processing. More details in the source code:

/* Surface update type is used by dc_update_surfaces_and_stream
 * The update type is determined at the very beginning of the function based
 * on parameters passed in and decides how much programming (or updating) is
 * going to be done during the call.
 *
 * UPDATE_TYPE_FAST is used for really fast updates that do not require much
 * logical calculations or hardware register programming. This update MUST be
 * ISR safe on windows. Currently fast update will only be used to flip surface
 * address.
 *
 * UPDATE_TYPE_MED is used for slower updates which require significant hw
 * re-programming however do not affect bandwidth consumption or clock
 * requirements. At present, this is the level at which front end updates
 * that do not require us to run bw_calcs happen. These are in/out transfer func
 * updates, viewport offset changes, recout size changes and pixel
depth changes.
 * This update can be done at ISR, but we want to minimize how often
this happens.
 *
 * UPDATE_TYPE_FULL is slow. Really slow. This requires us to recalculate our
 * bandwidth and clocks, possibly rearrange some pipes and reprogram
anything front
 * end related. Any time viewport dimensions, recout dimensions,
scaling ratios or
 * gamma need to be adjusted or pipe needs to be turned on (or
disconnected) we do
 * a full update. This cannot be done at ISR level and should be a rare event.
 * Unless someone is stress testing mpo enter/exit, playing with
colour or adjusting
 * underscan we don't expect to see this call at all.
 */

enum surface_update_type {
UPDATE_TYPE_FAST, /* super fast, safe to execute in isr */
UPDATE_TYPE_MED,  /* ISR safe, most of programming needed, no bw/clk change*/
UPDATE_TYPE_FULL, /* may need to shuffle resources */
};

Using tools:

Observe the current state, validate your findings, continue improvements.

14) Use AMD tools to check hardware state and driver programming: help on understanding your driver settings and checking the behavior when changing those settings.

  • DC Visual confirmation: Check multiple planes and pipe split policy.

  • DTN logs: Check display hardware state, including rotation, size, format, underflow, blocks in use, color block values, etc.

  • UMR: Check ASIC info, register values, KMS state - links and elements (framebuffers, planes, CRTCs, connectors). Source: UMR project documentation

15) Use generic DRM/KMS tools:

  • IGT test tools: Use generic KMS tests or develop your own to isolate the issue in the kernel space. Compare results across different GPU vendors to understand their implementations and find potential solutions. Here AMD also has specific IGT tests for its GPUs that is expect to work without failures on any AMD GPU. You can check results of HW-specific tests using different display hardware families or you can compare expected differences between the generic workflow and AMD workflow.

  • drm_info: This tool summarizes the current state of a display driver (capabilities, properties and formats) per element of the DRM/KMS workflow. Output can be helpful when reporting bugs.

Don’t give up!

Debugging issues in the AMD display driver can be challenging, but by following these tips and leveraging available resources, you can significantly improve your chances of success.

Worth mentioning: This blog post builds upon my talk, “I’m not an AMD expert, but…” presented at the 2022 XDC. It shares guidelines that helped me debug AMD display issues as an external developer of the driver.

Open Source Display Driver: The Linux kernel/AMD display driver is open source, allowing you to actively contribute by addressing issues listed in the official tracker. Tackling existing issues or resolving your own can be a rewarding learning experience and a valuable contribution to the community. Additionally, the tracker serves as a valuable resource for finding similar bugs, troubleshooting tips, and suggestions from AMD developers. Finally, it’s a platform for seeking help when needed.

Remember, contributing to the open source community through issue resolution and collaboration is mutually beneficial for everyone involved.

December 08, 2023

2023-12-10 UPDATE: From Mastodon, arcepi suggested the instability problems that I described below and served as a motivation to try Far Cry 6 on Linux could be coming from having switched from NVIDIA to AMD without reinstalling Windows, because of leftover files from the NVIDIA drivers. Today morning I reinstalled Windows to test this and, indeed, the Deathloop and Far Cry 6 crashes seem to be gone (yay!). That would have removed my original motivation to try to run the game on Linux, but it doesn’t take away the main points of the post. Do take into account that the instability doesn’t seem to exist anymore (and I hope this applies to more future titles I play) but it’s still the background story to explain why I decided to install Far Cry 6 on my Fedora 39 system, so the original post follows below.

If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting.

steam running on fedora 39.tn
Figure 1. Steam running on Fedora Linux 39

As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system.

In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons:

  1. For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need.

  2. Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about.

  3. Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that.

So, what changed?

sapphire pulse amd rx 6700 box
Figure 2. Sapphire Pulse AMD Radeon RX 6700 box

Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it.

However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because:

  1. AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing.

  2. The RX 6700 non-XT was on sale.

  3. It had the same performance as a PS5 or so.

  4. It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP).

After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6.

A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed and I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing.

On the course of the 2 hours that followed, I tried everything:

  1. Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope.

  2. Lowering quality and resolution.

  3. Disabling any advanced setting.

  4. Trying windowed mode, or borderless full screen.

  5. Vsync off or on.

  6. Disabling the overlays for Ubisoft, Steam, AMD.

  7. Rebooting multiple times.

  8. Uninstalling the drivers normally as well as using DDU and installing them again.

Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it.

The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem).

And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager.

gnome software steam.tn
Figure 3. Gnome Software showing Steam as an installed application

Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password).

My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it.

Context menu shown after right-clicking on Far Cry 6 on the Steam application, with the Properties option highlighted
Far Cry 6 Compatibility tab displaying the option to force the use of a specific Steam Play compatibility tool

Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get:

far cry 6 benchmark screenshot.tn
Figure 4. Far Cry 6 benchmark screenshot displaying the game running at over 100 frames per second

The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing.

As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click.

2023-12-10 UPDATE: From Mastodon, Jaco G and Berto Garcia tell me the game not stopping problem is present in all Ubisoft games and is directly related to the Ubisoft launcher. It keeps running after closing the game, which makes Steam think the game is still running. You can try to close it from the tray if you see the Ubisoft icon there and, if that fails, you can stop the game like I described above.

Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future.

Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6.

In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both.

Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.

December 06, 2023

During these last two weeks I have been working towards adding support for more operations and kinds of convolutions so we can run more interesting models. As a first target, I'm aiming to MobileDet, which though a bit old by now (it was introduced in 2020) is still the state of the art in object detection in mobile, used in for example Frigate NVR.

I haven't mentioned it in a few updates, but all this work keeps being sponsored by Libre Computer, who are aiming to be the first manufacturer of single board computers to provide accelerated machine learning with open source components. Check out Alta and Solitude for the first such boards in the market.

Upstreaming

Igalia's Christian Gmeiner has been giving me great feedback at the merge request, and as part of that I submitted a patch to the kernel to retrieve some parameters that are needed when programming the hardware and that are best not left hardcoded. 

This means that upstreaming to Mesa loses some urgency as we are anyway going to have to wait for the merge window for 6.8 opens, after 6.7 final is out.

Convolutions with 5x5 weights

Until now I had implemented support only for weights with dimensions 1x1 (aka pointwise convolutions) and 3x3 (the most common by far). Some of the convolutions in MobileDet use 5x5 weight tensors though, so I had to implement support for them. It was a matter of adding some extra complexity to the code that compresses the weight tensors in the format that the hardware expects.

I implemented this for all kind of supported convolutions: depthwise, strided, with padding, etc.

Tensor addition

I observed that the vendor blob implements addition operations with convolution jobs, so I looked deeper and saw that it was implementing the addition of two input tensors by placing them as the two channels of a single tensor, then passing them through a 1x1 convolution with a specially crafted weight tensor and bias vector.

This is working with hardcoded values for some specific input image dimensions, but I still need to gather more data so I can come up with a generic expression.

Softmax pooling

One more missing operation commonly used in models for mobile is pooling, in its different kinds: average, max, etc.

The blob implements these operations on the programmable core, with CL-like kernels.

So I undusted the work that I did in the first half of 2023 and added code to Teflon for passing these operations to the Gallium drivers. Then added a new kind of operation to the ML backend in Etnaviv to make use of the programmable core.

Things work fine, even if for now I am storing the kernel machine code in a blob inside the C code. The next step will be to implement the kernel in NIR and generate the machine code using the existing compiler in Etnaviv.

With this piece of work, we are now able to use all the hardware units in the NPU, and even if the programmable core in this configuration is really underpowered, it will allow us to keep the model in memory close to the NPU, instead of having to ping-pong between the NPU and CPU domains.

A new test suite

With new operations and kinds of convolutions being added, I was starting to have trouble testing all the possible combinations in a practical way, as the test suite that I had was taking more than 20 minutes for a full run.

To get around that, I reimplemented the tests in C++ with GoogleTest, which is supported by Emma Anholt's deqp-runner and will allow me to run the tests in parallel, making full use of the CPU cores in the board.

That made a big difference, but with so many testing combinations being added (+3000 as of now), it was still not fast enough for me. So I remembered an approach that we were considering to speed up execution of Vulkan and OpenGL conformance tests: caching the golden images that are used to compare and check that the output from the hardware is correct.

With that, the bottleneck is the network, as I store the cache in NFS, and I can run the full test suite in less than 3 minutes.

Only that I started finding some tests that were randomly failing, specially when the cache of test results had been already brought into the filesystem cache in the board. After a lot of scratching my head, I came to realize that the Etnaviv kernel driver was trying to submit up to 4 jobs at the same time to the hardware, if userspace was fast enough to enqueue that many jobs before the previous ones had finished.

There is a kernel module parameter to set the number of jobs that are submitted to the hardware at any given point, and setting that to 1 took me back to rock solid test results, which is an absolute need for keeping the driver author's sanity.

Next steps

I have quickly added support for a lot of new operations and parameter combinations and the code is not as clean as I would like, in part due to the need for some refactoring.

So in the next days I will be investing some time in cleaning things up, and afterwards will move to more operations in MobileDet.


November 29, 2023

I have not been so active for a while with writing these Fedora Workstation updates and part of the reason was that I felt I was beginning to repeat myself a lot, which I partly felt was a side effect of writing them so often, but with some time now since my last update I felt that time was ripe again. So what are some of the things we have been working on and what are our main targets going forward? This is not a exhaustive list, but hopefully items you find interesting. Apologize for weird sentences and potential spelling mistakes, but it ended up a a long post and when you read your own words over for the Nth time you start going blind to issues :)

PipeWire

PipeWire 1.0 is available! PipeWire keeps the Linux Multimedia revolution rolling[/caption]So lets start with one of your favorite topics, PipeWire. As you probably know PipeWire 1.0 is now out and I feel it is a project we definitely succeeded with, so big kudos to Wim Taymans for leading this effort. I think the fact that we got both the creator of JACK, Paul Davis and the creator of PulseAudio Lennart Poettering to endorse it means our goal of unifying the Linux audio landscape is being met. I include their endorsement comments from the PipeWire 1.0 release announcement here :

“PipeWire represents the next evolution of audio handling for Linux, taking
the best of both pro-audio (JACK) and desktop audio servers (PulseAudio) and
linking them into a single, seamless, powerful new system.”
– Paul Davis, JACK and Ardour author

“PipeWire is a worthy successor to PulseAudio, providing a feature set
closer to how modern audio hardware works, and with a security model
with today’s application concepts in mind. Version 1.0 marks a
major milestone in completing the adoption of PipeWire in the standard
set of Linux subsystems. Congratulations to the team!”
– Lennart Poettering, Pulseaudio and systemd author

So for new readers, PipeWire is a audio and video server we created for Fedora Workstation to replace PulseAudio for consumer audio, JACK for pro-audio and add similar functionality for video to your linux operating system. So instead of having to deal with two different sound server architectures users now just have to deal with one and at the same time they get the same advantages for video handling. Since PipeWire implemented both the PulseAudio API and the JACK API it is a drop in replacement for both of them without needing any changes to the audio applications built for those two sound servers. Wim Taymans alongside the amazing community that has grown around the project has been hard at work maturing PipeWire and adding any missing feature they could find that blocked anyone from moving to it from either PulseAudio and JACK. Wims personal focus recently has been on an IRQ based ALSA driver for PipeWire to be able to provide 100% performance parity with the old JACK server. So while a lot of Pro-audio users felt that PipeWire’s latency was already good enough, this work by Wim shaves of the last few milliseconds to reach the same level of latency as JACK itself had.

In parallel with the work on PipeWire the community and especially Collabora has been hard at work on the new 0.5 release of WirePlumber, the session manager which handles all policy issues for PipeWire. I know people often get a little confused about PipeWire vs WirePlumber, but think of it like this: PipeWire provides you the ability to output audio through a connected speaker, through a bluetooth headset, through an HDMI connection and so on, but it doesn’t provide any ‘smarts’ for how that happens. The smarts are instead provided by WirePlumber which then contains policies to decide where to route your audio or video, either based on user choice or through preset policies making the right choices automatically, like if you disconnect your USB speaker it will move the audio to your internal speaker instead. Anyway, WirePlumber 0.5 will be a major step forward for WirePlumber moving from using lua scripts for configuration to instead using JSON for configuration while retaining lua for scripting. This has many advantages, but I point you to this excellent blog post by Collabora’s Ashok Sidipotu for the details. Ashok got further details about WirePlumber 0.5 that you can find here.

With PipeWire 1.0 out the door I feel we are very close to reaching one of our initial goals with PipeWire, to remove the need for custom pro-audio distributions like Fedora JAM or Ubuntu Studio, and instead just let audio folks be able to use the same great Fedora Workstation as the rest of the world. With 1.0 done Wim plans next to look a bit at things like configuration tools and similar used by pro-audio folks and also dive into the Flatpak portal needs of pro-audio applications more, to ensure that Flatpaks + PipeWire is the future of pro-audio.

On the video handling side its been a little slow going since there applications need to be ported from relying directly on v4l. Jan Grulich has been working with our friends at Mozilla and Google to get PipeWire camera handling support into Firefox and Google Chrome. At the moment it looks like the Firefox support will land first, in fact Jan has set up a COPR that lets you try it out here. For tracking the upstream work in WebRTC to add PipeWire support Jan set up this tracker bug. Getting the web browsers to use PipeWire is important both to enable the advanced video routing capabilities of PipeWire, but it will also provide applications the ability to use libcamera which is a needed for new modern MIPI cameras to work properly under Linux.

Another important application to get PipeWire camera support into is OBS Studio and the great thing is that community member Georges Stavracas is working on getting the PipeWire patches merged into OBS Studio, hopefully in time for their planned release early next year. You can track Georges work in this pull request.

For more information about PipeWire 1.0 I recommend our interview with Wim Taymans in Fedora Magazine and also the interview with Wim on Linux Unplugged podcast.

HDR
HDRHDR, or High Dynamic Range, is another major effort for us. HDR is a technology I think many of you have become familiar with due to it becoming quite common in TVs these days. It basically provides for greatly increased color depth and luminescence on your screen. This is a change that entails a lot of changes through the stack, because when you introduce into an existing ecosystem like the Linux desktop you have to figure out how to combine both new HDR capable applications and content and old non-HDR applications and content. Sebastian Wick, Jonas Ådahl, Oliver Fourdan, Michel Daenzer and more on the team has been working with other members of the ecosystem from Intel, AMD, NVIDIA, Collabora and more to pick and define the standards and protocols needed in this space. A lot of design work was done early in the year so we been quite focused on implementation work across the drivers, Wayland, Mesa, GStreamer, Mutter, GTK+ and more. Some of the more basic scenarios, like running a fullscreen HDR application is close to be ready, while we are still working hard on getting all the needed pieces together for the more complex scenarios like running SDR and HDR windows composited together on your desktop. So getting for instance full screen games to run in HDR mode with Steam should happen shortly, but the windowed support will probably land closer to summer next year.

Wayland remoting
One feature we been also spending a lot of time on is enabling remote logins to a Wayland desktop. You have been able to share your screen under Wayland more or less from day one, but it required your desktop session to be already active. But lets say you wanted to access your Wayland desktop running on a headless system you been out of luck so far and had to rely on the old X session instead. So putting in place all the pieces for this has been quite an undertaking with work having been done on PipeWire, on Wayland portals, gnome remote desktop daemon, libei; the new input emulation library, gdm and more. The pieces needed are finally falling into place and we expect to have everything needed landed in time for GNOME 46. This support is currently done using a private GNOME API, but a vendor less API is being worked on to replace it.

As a sidenote here not directly related to desktop remoting, but libei has also enabled us to bring xtest support to XWayland which was important for various applications including Valves gamescope.

NVIDIA drivers
One area we keep investing in is improving the state of NVIDIA support on Linux. This comes both in the form of being the main company backing the continued development of the Nouveau graphics driver. So the challenge with Nouveau is that for the longest while it offered next to no hardware acceleration for 3D graphics. The reason for this was that the firmware that NVIDIA provided for Nouveau to use didn’t expose that functionality and since recent generations of NVIDIA cards only works with firmware signed by NVIDIA this left us stuck. So Nouveau was a good tool for doing an initial install of a system, but if you where doing any kind of serious 3D acceleration, including playing games, then you would need to install the NVIDIA binary driver. So in the last year that landscape around that has changed drastically, with the release of the new out-of-tree open source driver from NVIDIA. Alongside that driver a new firmware has also been made available , one that do provide full support for hardware acceleration.
Let me quickly inject a quick explanation of out-of-tree versus in-tree drivers here. An in-tree driver is basically a kernel driver for a piece of hardware that has been merged into the official Linux kernel from Linus Torvalds and is thus being maintained as part of the official Linux kernel releases. This ensures that the driver integrates well with the rest of the Linux kernel and that it gets updated in sync with the rest of the Linux kernel. So Nouveau is an in-tree kernel driver which also integrates with the rest of the open source graphics stack, like Mesa. The new NVIDIA open source driver is an out-of-tree driver which ships as a separate source code release on its own schedule, but of course NVIDIA works to keeps it working with the upstream kernel releases (which is a lot of work of course and thus considered a major downside to being an out of tree driver).

As of the time of writing this blog post NVIDIAs out-of-tree kernel driver and firmware is still a work in progress for display usercases, but that is changing with NVIDIA exposing more and more display features in the driver (and the firmware) with each new release they do. But if you saw the original announcement of the new open source driver from NVIDIA and have been wondering why no distribution relies on it yet, this is why. So what does this mean for Nouveau? Well our plan is to keep supporting Nouveau for the foreseeable future because it is an in-tree driver, which is a lot easier to ensure keeps working with each new upstream kernel release.

At the same time the new firmware updates allows Nouveau to eventually offer performance levels competitive with the official out-of-tree driver, kind of how the open source AMD driver with MESA offers comparable performance to AMD binary GPU driver userspace. So Nouvea maintainer Ben Skeggs spent the last year working hard on refactoring Nouveau to work with the new firmware and we now have a new release of Nouveau out showing the fruits of that labor, enabling support for NVIDIAs latest chipset. Over time we will have it cover more chipset and expand Vulkan and OpenGL (using Zink) support to be a full fledged accelerated graphics driver.
So some news here, Ben after having worked tirelessly on keeping Nouveau afloat for so many years decided he needed a change of pace and thus decided to leave software development behind for the time being. A big thank you to Ben from all us at Red Hat and Fedora ! The good news is that Danilo Krummrich will take over as the development lead, with Lyude Paul taking on working on the Display side specifically of the driver. We also expect to have other members of the team chipping in too. They will pick up Bens work and continue working with NVIDIA and the community on a bright future for Nouveau.

So as I mentioned though the new open source driver from NVIDIA is still being matured for the display usercase and until it works fully as a display driver neither will Nouveau be able to be a full alternative since they share the same firmware. So people will need to rely on the binary NVIDIA Driver for some time still. One thing we are looking at there and discussing is if there are ways for us to improve the experience of using that binary driver with Secure Boot enabled. Atm that requires quite a bit of manual fiddling with tools like mokutils, but we have some ideas on how to streamline that a bit, but it is a hard nut to solve due to a combination of policy issues, legal issues, security issues and hardware/UEFI bugs so I am making no promises at this point, just a promise that it is something we are looking at.

Accessibility
Accessibility is an important feature for us in Fedora Workstation and thus we hired Lukáš Tyrychtr to focus on the issue. Lukáš has been working through across the stack fixing issues blocking proper accessibility support in Fedora Workstation and also participated in various accessibility related events. There is still a lot to do there so I was very happy to hear recently that the GNOME Foundation got a million Euro sponsorship from the Sovereign Tech Fund to improve various things across the stack, especially improving accessibility. So the combination of Lukáš continued efforts and that new investment should make for a much improved accessibility experience in GNOME and in Fedora Workstation going forward.

GNOME Software
Another area that we keep investing in is improving GNOME Software, with Milan Crha working continuously on bugfixing and performance improvements. GNOME Software is actually a fairly complex piece of software as it has to be able to handle the installation and updating of RPMS, OSTree system images, Flatpaks, fonts and firmware for us in addition to the formats it handles for other distributions. For some time it felt was GNOME Software was struggling with the load of all those different formats and usercases and was becoming both slow and with a lot of error messages. Milan has been spending a lot of time dealing with those issues one by one and also recently landed some major performance improvements making the GNOME Software experience a lot better. One major change that Milan is working on that I think we will be able to land in Fedora Workstation 40/41 is porting GNOME Software to use DNF5. The main improvement end users will probably notice is that it unifies the caches used for GNOME Software and using dnf on the command line, saving you storage space and also ensuring the two are fully in sync on what RPMS is installed/updated at any given time.

Fedora and Flatpaks

Flatpaks is another key element of our strategy for moving the Linux desktop forward and as part of that we have now enabled all of Flathub to be available if you choose to enable 3rd party repositories when you install Fedora Workstation. This means that the huge universe of applications available on Flathub will be easy to install through GNOME Software alongside the content available in Fedora’s own repositories. That said we have also spent time improving the ease of making Fedora Flatpaks. Owen Taylor jumped in and removed the dependency on a technology called ‘modularity‘ which was initially introduced to Fedora to bring new features around having different types of content and ease keeping containers up to date. Unfortunately it did not work out as intended and instead it became something that everyone just felt made things a lot more complicated, including building Flatpaks from Fedora content. With Owens updates building Flatpaks in Fedora has become a lot simpler and should help energize the effort building Flatpaks in Fedora.

Toolbx
As we continue marching towards a vision for Fedora Workstation to be a highly robust operating we keep evolving Toolbx. Our tool for making running your development environment(s) inside a container and thus allows you to both keep your host OS pristine and up to date, while at the same time using specific toolchains and tools inside the development container. This is a hard requirement for immutable operating systems such as Fedora Silverblue or Universal blue, but it is also useful on operating systems like Fedora Workstation as a way to do development for other platforms, like for instance Red Hat Enterprise Linux.

A major focus for Toolbx since the inception is to get it a stage where it is robust and reliable. So for instance while we prototyped it as a shell script, today it is written in Go to be more maintainable and also to confirm with the rest of the container ecosystem. A recent major step forward for getting that stability there is that starting with Fedora 39, the toolbox image is now a release blocking deliverable. This means it is now built as part of the nightly compose and the whole Toolbx stack (ie. the fedora-toolbox image and the toolbox RPM) is part of the release-blocking test criteria. This shows the level of importance we put on Toolbx as the future of Linux software development and its criticality to Fedora Workstation. Earlier, we built the fedora-toobox image as a somewhat separate and standalone thing, and people interested in Toolbx would try to test and keep the whole thing working, as much as possible, on their own. This was becoming unmanageable because Toolbx integrates with many parts of the distribution from Mutter (ie, the Wayland and X sockets) to Kerberos to RPM (ie., %_netsharedpath in /usr/lib/rpm/macros.d/macros.toolbox) to glibc locale definitions and translations. The list of things that could change elsewhere in Fedora, and end up breaking Toolbx, was growing too large for a small group of Toolbx contributors to keep track of.

We the next release we now also have built-in support for Arch Linux and Ubuntu through the –distro flag in toolbox.git main, thanks again to the community contributors who worked with us on this allowing us to widen the amount of distros supported while keeping with our policy of reliability and dependability. And along the same theme of ensuring Toolbx is a tool developers can rely on we have added lots and lots of new tests. We now have more than 280 tests that run on CentOS Stream 9, all supported Fedoras and Rawhide, and Ubuntu 22.04.

Another feature that Toolbx maintainer Debarshi Ray put a lot of effort into is setting up full RHEL containers in Toolbx on top of Fedora. Today, thanks to Debarshi work you do subscription-manager register --username user@domain.name on the Fedora or RHEL host, and the container is automatically entitled to RHEL content. We are still looking at how we can provide a graphical interface for that process or at least how to polish up the CLI for doing subscription-manager register. If you are interested in this feature, Debarshi provides a full breakdown here.

Other nice to haves added is support for enterprise FreeIPA set-ups, where the user logs into their machine through Kerberos and support for automatically generated shell completions for Bash, fish and Z shell.

Flatpak and Foreman & Katello
For those out there using Foreman to manage your fleet of Linux installs we have some good news. We are in the process of implementing support for Flatpaks in these tools so that you can manage and deploy applications in the Flatpak format using them. This is still a work in progress, but relevant Pulp and Katello commits are Pulp commit Support for Flatpak index endpoints and Katello commits Reporting results of docker v2 repo discovery” and Support Link header in docker v2 repo discovery“.

LVFS
Another effort that Fedora Workstation has brought to the world of Linux and that is very popular arethe LVFS and fwdup formware update repository and tools. Thanks to that effort we are soon going to be passing one hundred million firmware updates on Linux devices soon! These firmware updates has helped resolve countless bugs and much improved security for Linux users.

But we are not slowing down. Richard Hughes worked with industry partners this year to define a Bill of Materials defintion to firmware updates allowing usings to be better informed on what is included in their firmware updates.

We now support over 1400 different devices on the LVFS (covering 78 different protocols!), with over 8000 public firmware versions (image below) from over 150 OEMs and ODMs. We’ve now done over 100,000 static analysis tests on over 2,000,000 EFI binaries in the firmware capsules!

Some examples of recently added hardware:
* AMD dGPUs, Navi3x and above, AVer FONE540, Belkin Thunderbolt 4 Core Hub dock, CE-LINK TB4 Docks,CH347 SPI programmer, EPOS ADAPT 1×5, Fibocom FM101, Foxconn T99W373, SDX12, SDX55 and SDX6X devices, Genesys GL32XX SD readers, GL352350, GL3590, GL3525S and GL3525 USB hubs, Goodix Touch controllers, HP Rata/Remi BLE Mice, Intel USB-4 retimers, Jabra Evolve 65e/t and SE, Evolve2, Speak2 and Link devices, Logitech Huddle, Rally System and Tap devices, Luxshare Quad USB4 Dock, MediaTek DP AUX Scalers, Microsoft USB-C Travel Hub, More Logitech Unifying receivers, More PixartRF HPAC devices, More Synaptics Prometheus fingerprint readers, Nordic HID devices, nRF52 Desktop Keyboard, PixArt BLE HPAC OTA, Quectel EM160 and RM520, Some Western Digital eMMC devices, Star Labs StarBook Mk VIr2, Synaptics Triton devices, System76 Launch 3, Launch Heavy 3 and Thelio IO 2, TUXEDO InfinityBook Pro 13 v3, VIA VL122, VL817S, VL822T, VL830 and VL832, Wacom Cintiq Pro 27, DTH134 and DTC121, One 13 and One 12 Tablets

InputLeap on Wayland
One really interesting feature that landed for Fedora Workstation 39 was the support for InputLeap. It’s probably not on most peoples radar, but it’s an important feature for system administrators, developers and generally anyone with more than a single computer on their desk.

Historically, InputLeap is a fork of Barrier which itself was a fork of Synergy, it allows to share the same input devices (mouse, keyboard) across different computers (Linux, Windows, MacOS) and to move the pointer between the screens of these computers seamlessly as if they were one.

InputLeap has a client/server architecture with the server running on the main host (the one with the keyboard and mouse connected) and multiple clients, the other machines sitting next to the server machine. That implies two things, the InputLeap daemon on the server must be able to “capture” all the input events to forward them to the remote clients when the pointer reaches the edge of the screen, and the InputLeap client must be able to “replay” those input events on the client host to make it as if the keyboard and mouse were connected directly to the (other) computer. Historically, that relied on X11 mechanisms and neither InputLeap (nor Barrier or even Synergy as a matter of fact) would work on Wayland.

This is one of the use cases that Peter Hutterer had in mind when he started libEI, a low-level library aimed at providing a separate communication channel for input emulation in Wayland compositors and clients (even though libEI is not strictly tied to Wayland). But libEI alone is far from being sufficient to implement InputLeap features, with Wayland we had the opportunity to make things more secure than X11 and take benefit from the XDG portal mechanisms.

On the client side, for replaying input events, it’s similar to remote desktop but we needed to update the existing RemoteDesktop portal to pass the libEI socket. On the server side, it required a brand new portal for input capture . These also required their counterparts in the GNOME portal, for both RemoteDesktop and InputCapture [8], and of course, all that needs to be supported by the Wayland compositor, in the case of GNOME that’s mutter. That alone was a lot of work.

Yet, even with all that in place, that’s just the basic requirements to support a Synergy/Barrier/InputLeap-like feature, the tools in question need to have support for the portal and libEI implemented to benefit from the mechanisms we’ve put in place and for the all feature to work and be usable. So libportal was also updated to support the new portal features and a new “Wayland” backend alongside the X11, Windows and Mac OS backends was contributed to InputLeap.

The merge request in InputLeap was accepted very early, even before the libEI API was completely stabilized and before the rest of the stack was merged, which I believe was a courageous choice from Povilas (who maintains InputLeap) which helped reduce the time to have the feature actually working, considering the number of components and inter-dependencies involved. Of course, there are still features missing in the Wayland backend, like copy/pasting between hosts, but a clipboard interface was fairly recently added to the remote desktop portal and therefore could be used by InputLeap to implement that feature.

Fun fact, Xwayland also grew support for libEI also using the remote desktop portal and wires that to the XTEST extension on X11 that InputLeap’s X11 backend uses, so it might even be possible to use the X11 backend of InputLeap in the client side through Xwayland, but of course it’s better to use the Wayland backend on both the client and server sides.

InputLeap is a great example of collaboration between multiple parties upstream including key contributions from us at Red Hat to implement and contribute a feature that has been requested for years upstream..

Thank you to Olivier Fourdan, Debarshi Ray, Richard Hughes, Sebastian Wick and Jonas Ådahl for their contributions to this blog post.

November 17, 2023

Progress

 
This update's highlight is that last week I finally got the TP jobs working, which allows us to make the tensor manipulation in the HW, removing 18ms from the tensor preprocessing. We can currently use them for transposing tensors from the format that TensorFlow prefers to that which the HW expects and the other way around, and for lowering strided convolutions to regular ones.
 
This makes our image classification benchmark twice as fast, as expected:

tomeu@arm-64:~/mesa$ ETNA_MESA_DEBUG=ml_msgs python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
Running the NN job took 13 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 15.650ms

60 FPS is already quite interesting for many use cases, but the proprietary driver is able to do the same at around 8 ms, so there is still plenty of room for improvements.
 
Some preliminary testing indicates that enabling zero-run length compression in the weight buffers will make the biggest difference, so that is what I will be working on when I get back to performance work.

Additionally, I also got some experimental jobs running on the programmable core in this NPU, which will allow us to run more advanced models, which tend to use operations that the hardware couldn't be designed for back then.

Upstreaming is going well, those interested can follow it here:
 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25714.
 

Next steps

 

These will be my priorities during the next couple of weeks, in order:

  1. Upstreaming
  2. Get the Mobilenet SSD V1 model running on the HW, for object detection
  3. Performance
November 15, 2023

Hi! This month I’ve started a new PotM called pyonji. It’s an easy-to-use replacement for the venerable git-send-email command. The goal is to make it less painful for a new contributor not familiar with the e-mail based patch submission to submit patches.

Users are expected to use the same workflow as GitHub, GitLab and friends when contributing: create a new branch and add commits there. Instead of pushing to a fork though, users simply invoke pyonji.

When run for the first time, pyonji will ask for your e-mail account details: e-mail address, password… and nothing else. The SMTP server hostname, port and other details are automatically detected (via multiple means: SRV records, Mozilla auto-configuration database, common subdomains, etc). Once the password is verified pyonji will store everything in the Git configuration (in the same fashion that git-send-email expects it).

Then pyonji will present a UI with a list of commits to be submitted for review. The user can tweak details such as the base branch, the mailing list address, the version of the patch, however that’s rarely needed: pyonji will find good defaults for these. The user can add a cover letter if desired with a longer description for the set of patches. Then the big blue “submit” button can be pressed to send the patches.

Unlike git-send-email, pyonji will remember for you what the last submitted version number was (and automatically increment it). pyonji will save the cover letter so that it’s not lost if the network is flaky and you don’t need to re-type it for the next submission. pyonji will not waste your time with uninteresting questions such as “which encoding should I use?”. pyonji will automatically include the base tree information in the patches so that any conflicts are more easily resolved by the reviewer.

Please try it and let me know how it goes! In particular, I’m wondering if the logic to auto-detect the e-mail server settings are robust enough, or if there are e-mail providers I don’t handle correctly yet.

There is still a lot to be done to improve pyonji. Setup is painful for GMail and Fastmail users because app passwords are required. I wanted to use OAuth to fix this but both of these providers heavily restrict how SMTP OAuth apps can be registered. Setup doesn’t work for ProtonMail users because the bridge uses a self-signed certificate, that can be fixed but setup will remain painful. I’d like to add UI to change the base branch, improve the heuristics to pick a good default for the base branch, support for the MAINTAINERS file for easier contribution to big projects such as the kernel, add an easy way to mark a patch series as RFC, and probably a million of other things.

Apart from pyonji, I’ve been working on some graphics-related stuff as always. We’re getting closer to the wlroots 0.17 release, fixing the few remaining blocking issues. A new API to clip surfaces with the scene-graph has been merged, many thanks to Alexander Orzechowski and Isaac Freund! I’ve fixed a Mesa regression introduced by a previous patch I’ve reviewed related to EGL and split render/display SoCs (I hate these). And I’ve been discussing with other kernel developers about a way to stop (ab)using KMS dumb buffers for split render/display SoCs (I swear I really hate these). We’re trying to come up with a solution which could on the long run also help with the Buffer Allocation Constraints Problem (see the XDC 2020 talk for more info).

I’ve written a few patches to add support for OAuth 2.0 refresh tokens to meta.sr.ht. If you’ve ever used an OAuth sr.ht app (like hottub or yojo to integrate builds.sr.ht with GitHub or Forgejo), you probably know that tokens expire after one year, and that you need to redo the setup step when that happens. This is annoying, and adding support for refresh tokens to meta.sr.ht and the OAuth apps should fix this.

Last, I’m now part of the FreeDesktop Code of Conduct team. This is not a technical role, but it’s very important to have folks doing this work. I’ve attended a Code of Conduct workshop to learn how to do it, that’s been pretty interesting and helpful. The workshop focused a lot more on trying to change people’s behavior, instead of bringing down the ban hammer.

That’s all for now, see you next month!

Introduction

We spent a whole week rewriting nouveau’s website — the drivers for NVIDIA cards. It started as a one-person effort, but it led to a few people helping me out. We addressed several issues in the nouveau website and improved it a lot. The redesign is live on nouveau.freedesktop.org.

In this article, we’ll go over the problems with the old site and the work we’ve done to fix them.

Problems With Old Website

I’m going to use this archive as a reference for the old site.

The biggest problem with the old site was that the HTML and CSS were written 15 years ago and have never been updated since. So in 2023, we were relying on outdated HTML/CSS code. Obviously, this was no fun from a reader’s perspective. With the technical debt and lack of interest, we were suffering from several problems. The only good thing about the old site was that it didn’t use JavaScript, which I wanted to keep for the rewrite.

Fun fact: the template was so old that it could be built for browsers that don’t support HTML5!

Not Responsive

“Responsive design” in web design means making the website accessible on a variety of screen sizes. In practice, a website should adapt to work on mobile devices, tablets, and laptops/computer monitors.

In the case of the nouveau website, it didn’t support mobile screen sizes properly. Buttons were hard to tap and text was small. Here are some screenshots taken in Firefox on my Razer Phone 2:

Small buttons and text in the navigation bar that are difficult to read and tap.

Small text in a table that forces the reader to zoom in.

No Dark Style

Regardless of style preferences, having a dark style/theme can help people who are sensitive to light and battery life on AMOLED displays. Dark styles are useful for those who absolutely need them.

No SEO

Search Engine Optimization (SEO) is the process of making a website more discoverable on search engines like Google. We use various elements such as title, description, icon, etc. to increase the ranking in search engines.

In the case of nouveau, there were no SEO efforts. If we look at the old nouveau homepage’s <head> element, we get the following:

<head>
<meta charset="utf-8">
<title>nouveau</title>
<link rel="stylesheet" href="style.css" type="text/css">
<link rel="stylesheet" href="xorg.css" type="text/css">
<link rel="stylesheet" href="local.css" type="text/css">
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="https://gitlab.freedesktop.org/nouveau/wiki/-/edit/main/sources/index.mdwn">
</head>

The only thing there was a title, which is, obviously, far from desirable. The rest were CSS stylesheets, wiki source link, and character set.

Readability Issues

One of the biggest problems with nouveau’s website (apart from the homepage) is the lack of a maximum width. Large paragraphs stretch across the screen, making it difficult to read.

Process of Rewriting

Before I started the redesign, I talked to Karol Herbst, one of the nouveau maintainers. He had been wanting to redesign the nouveau site for ages, so I asked myself, “How hard can it be?” Well… mistakes were made.

The first step was to look at the repository and learn about the tools freedesktop.org uses for their website. freedesktop.org uses ikiwiki to generate the wiki. Problem is: it’s slow and really annoying to work with. The first thing I did was create a Fedora toolbox container. I installed the ikiwiki package to generate the website locally.

The second step was to rewrite the CSS and HTML template. I took a look at page.tmpl — the boilerplate. While looking at it, I discovered another problem: the template is unreadable. So I worked on that as well.

I ported to modern HTML elements, like <nav> for the navigation bar, <main> for the main content, and <footer> for the footer.

The third step was to rewrite the CSS. In the <head> tag above, we can see that the site pulls CSS from many sources: style.css, xorg.css, and local.css. So what I did was to delete xorg.css and local.css, delete the contents of style.css, and rewrite it from scratch. I copied a few things from libadwaita, namely its buttons and colors.

And behold… merge request !29!

Despite the success of the rewrite, I ran into a few roadblocks. I couldn’t figure out how to make the freedesktop.org logo dark style. Luckily, my friend kramo helped me out by providing an SVG file of the logo that adapts to dark style, based on Wikipedia’s. They also adjusted the style of the website to make it look nicer.

I also couldn’t figure out what to do with the tables because the colors were low contrast. Also, the large table on the Feature Matrix page was limited in maximum width, which would make it uncomfortable on large monitors. Lea from Fyra Labs helped with the tables and fixed the problems. She also adjusted the style.

After that, the rewrite was mostly done. Some reviewers came along and suggested some changes. Karol wanted the rewrite so badly that he opened a poll asking if he should merge it. It was an overwhelming yes, so… it got merged!

Conclusion

As Karol, puts it:

“check out the nouveau repo, then cry, then reconsider your life choices”

In all seriousness, I’ve had a great time working on it. While this is the nouveau site in particular, I plan to eventually rewrite the entire freedesktop.org site. However, I started with nouveau because it was hosted on GitLab. Meanwhile, other sites/pages are hosted on freedesktop.org’s cgit instance, which were largely inaccessible for me to contribute to.

Ideally, we’d like to move from ikiwiki to something more modern, like a framework or a better generator, but we’ll have to see who’s willing to work on it and maintain it.

November 11, 2023

Today, 12 years after the meeting where AppStream was first discussed and 11 years after I released a prototype implementation I am excited to announce AppStream 1.0! 🎉🎉🎊

Check it out on GitHub, or get the release tarball or read the documentation or release notes! 😁

Some nostalgic memories

I was not in the original AppStream meeting, since in 2011 I was extremely busy with finals preparations and ball organization in high school, but I still vividly remember sitting at school in the students’ lounge during a break and trying to catch the really choppy live stream from the meeting on my borrowed laptop (a futile exercise, I watched parts of the blurry recording later).

I was extremely passionate about getting software deployment to work better on Linux and to improve the overall user experience, and spent many hours on the PackageKit IRC channel discussing things with many amazing people like Richard Hughes, Daniel Nicoletti, Sebastian Heinlein and others.

At the time I was writing a software deployment tool called Listaller – this was before Linux containers were a thing, and building it was very tough due to technical and personal limitations (I had just learned C!). Then in university, when I intended to recreate this tool, but for real and better this time as a new project called Limba, I needed a way to provide metadata for it, and AppStream fit right in! Meanwhile, Richard Hughes was tackling the UI side of things while creating GNOME Software and needed a solution as well. So I implemented a prototype and together we pretty much reshaped the early specification from the original meeting into what would become modern AppStream.

Back then I saw AppStream as a necessary side-project for my actual project, and didn’t even consider me as the maintainer of it for quite a while (I hadn’t been at the meeting afterall). All those years ago I had no idea that ultimately I was developing AppStream not for Limba, but for a new thing that would show up later, with an even more modern design called Flatpak. I also had no idea how incredibly complex AppStream would become and how many features it would have and how much more maintenance work it would be – and also not how ubiquitous it would become.

The modern Linux desktop uses AppStream everywhere now, it is supported by all major distributions, used by Flatpak for metadata, used for firmware metadata via Richard’s fwupd/LVFS, runs on every Steam Deck, can be found in cars and possibly many places I do not know yet.

What is new in 1.0?

API breaks

The most important thing that’s new with the 1.0 release is a bunch of incompatible changes. For the shared libraries, all deprecated API elements have been removed and a bunch of other changes have been made to improve the overall API and especially make it more binding-friendly. That doesn’t mean that the API is completely new and nothing looks like before though, when possible the previous API design was kept and some changes that would have been too disruptive have not been made. Regardless of that, you will have to port your AppStream-using applications. For some larger ones I already submitted patches to build with both AppStream versions, the 0.16.x stable series as well as 1.0+.

For the XML specification, some older compatibility for XML that had no or very few users has been removed as well. This affects for example release elements that reference downloadable data without an artifact block, which has not been supported for a while. For all of these, I checked to remove only things that had close to no users and that were a significant maintenance burden. So as a rule of thumb: If your XML validated with no warnings with the 0.16.x branch of AppStream, it will still be 100% valid with the 1.0 release.

Another notable change is that the generated output of AppStream 1.0 will always be 1.0 compliant, you can not make it generate data for versions below that (this greatly reduced the maintenance cost of the project).

Developer element

For a long time, you could set the developer name using the top-level developer_name tag. With AppStream 1.0, this is changed a bit. There is now a developer tag with a name child (that can be translated unless the translate="no" attribute is set on it). This allows future extensibility, and also allows to set a machine-readable id attribute in the developer element. This permits software centers to group software by developer easier, without having to use heuristics. If we decide to extend the developer information per-app in future, this is also now possible. Do not worry though the developer_name tag is also still read, so there is no high pressure to update. The old 0.16.x stable series also has this feature backported, so it can be available everywhere. Check out the developer tag specification for more details.

Scale factor for screenshots

Screenshot images can now have a scale attribute, to indicate an (integer) scaling factor to apply. This feature was a breaking change and therefore we could not have it for the longest time, but it is now available. Please wait a bit for AppStream 1.0 to become deployed more widespread though, as using it with older AppStream versions may lead to issues in some cases. Check out the screenshots tag specification for more details.

Screenshot environments

It is now possible to indicate the environment a screenshot was recorded in (GNOME, GNOME Dark, KDE Plasma, Windows, etc.) via an environment attribute on the respective screenshot tag. This was also a breaking change, so use it carefully for now! If projects want to, they can use this feature to supply dedicated screenshots depending on the environment the application page is displayed in. Check out the screenshots tag specification for more details.

References tag

This is a feature more important for the scientific community and scientific applications. Using the references tag, you can associate the AppStream component with a DOI (Digital object identifier) or provide a link to a CFF file to provide citation information. It also allows to link to other scientific registries. Check out the references tag specification for more details.

Release tags

Releases can have tags now, just like components. This is generally not a feature that I expect to be used much, but in certain instances it can become useful with a cooperating software center, for example to tag certain releases as long-term supported versions.

Multi-platform support

Thanks to the interest and work of many volunteers, AppStream (mostly) runs on FreeBSD now, a NetBSD port exists, support for macOS was written and a Windows port is on its way! Thank you to everyone working on this 🙂

Better compatibility checks

For a long time I thought that the AppStream library should just be a thin layer above the XML and that software centers should just implement a lot of the actual logic. This has not been the case for a while, but there was still a lot of complex AppStream features that were hard for software centers to implement and where it makes sense to have one implementation that projects can just use.

The validation of component relations is one such thing. This was implemented in 0.16.x as well, but 1.0 vastly improves upon the compatibility checks, so you can now just run as_component_check_relations and retrieve a detailed list of whether the current component will run well on the system. Besides better API for software developers, the appstreamcli utility also has much improved support for relation checks, and I wrote about these changes in a previous post. Check it out!

With these changes, I hope this feature will be used much more, and beyond just drivers and firmware.

So much more!

The changelog for the 1.0 release is huge, and there are many papercuts resolved and changes made that I did not talk about here, like us using gi-docgen (instead of gtkdoc) now for nice API documentation, or the many improvements that went into better binding support, or better search, or just plain bugfixes.

Outlook

I expect the transition to 1.0 to take a bit of time. AppStream has not broken its API for many, many years (since 2016), so a bunch of places need to be touched even if the changes themselves are minor in many cases. In hindsight, I should have also released 1.0 much sooner and it should not have become such a mega-release, but that was mainly due to time constraints.

So, what’s in it for the future? Contrary to what I thought, AppStream does not really seem to be “done” and fetature complete at a point, there is always something to improve, and people come up with new usecases all the time. So, expect more of the same in future: Bugfixes, validator improvements, documentation improvements, better tools and the occasional new feature.

Onwards to 1.0.1! 😁

November 10, 2023

TLDR: see the title of this blog post, it's really that trivial.

Now that GodotWayland has been coming for ages and all new development focuses on a pile of software that steams significantly less, we're seeing cracks appear in the old Xorg support. Not intentionally, but there's only so much time that can be spent on testing and things that are more niche fall through. One of these was a bug I just had the pleasure of debugging and was triggered by GNOME on Xorg user using the xf86-input-libinput driver for tablet devices.

On the surface of it, this should be fine because libinput (and thus xf86-input-libinput) handles tablets just fine. But libinput is the new kid on the block. The old kid on said block is the xf86-input-wacom driver, older than libinput by slightly over a decade. And oh man, history has baked things into the driver that are worse than raisins in apple strudel [1].

The xf86-input-libinput driver was written as a wrapper around libinput and makes use of fancy things that (from libinput's POV) have always been around: things like input device hotplugging. Fancy, I know. For tablet devices the driver creates an X device for each new tool as it comes into proximity first. Future events from that tool will go through that device. A second tool, be it a new pen or the eraser on the original pen, will create a second X device and events from that tool will go through that X device. Configuration on any device will thus only affect that particular pen. Almost like the whole thing makes sense.

The wacom driver of course doesn't do this. It pre-creates X devices for some possible types of tools (pen, eraser, and cursor [2] but not airbrush or artpen). When a tool goes into proximity the events are sent through the respective device, i.e. all pens go through the pen tool, all erasers through the eraser tool. To actually track pens there is the "Wacom Serial IDs" property that contains the current tool's serial number. If you want to track multiple tools you need to query the property on proximity in [4]. At the time this was within a reasonable error margin of a good idea.

Of course and because MOAR CONFIGURATION! will save us all from the great filter you can specify the "ToolSerials" xorg.conf option as e.g. "airbrush;12345;artpen" and get some extra X devices pre-created, in this case a airbrush and artpen X device and an X device just for the tool with the serial number 12345. All other tools multiplex through the default devices. Again, at the time this was a great improvement. [5]

Anyway, where was I? Oh, right. The above should serve as a good approximation of a reason why the xf86-input-libinput driver does not try to be fullly compatible to the xf86-input-wacom driver. In everyday use these things barely matter [6] but for the desktop environment which needs to configure these devices all these differences mean multiple code paths. Those paths need to be tested but they aren't, so things fall through the cracks.

So quite a while ago, we made the decision that until Xorg goes dodo, the xf86-input-wacom driver is the tablet driver to use in GNOME. So if you're using a GNOME on Xorg session [7], do make sure the xf86-input-wacom driver is installed. It will make both of us happier and that's a good aim to strive for.

[1] It's just a joke. Put the pitchforks down already.
[2] The cursor is the mouse-like thing Wacom sells. Which is called cursor [3] because the English language has a limited vocabulary and we need to re-use words as much as possible lest we run out of them.
[3] It's also called puck. Because [2].
[4] And by "query" I mean "wait for the XI2 event notifying you of a property change". Because of lolz the driver cannot update the property on proximity in but needs to schedule that as idle func so the property update for the serial always arrives at some unspecified time after the proximity in but hopefully before more motion events happen. Or not, and that's how hope dies.
[5] Think about this next time someone says they long for some unspecified good old days.
[6] Except the strip axis which on the wacom driver is actually a bit happily moving left/right as your finger moves up/down on the touch strip and any X client needs to know this. libinput normalizes this to...well, a normal value but now the X client needs to know which driver is running so, oh deary deary.
[7] e.g because your'e stockholmed into it by your graphics hardware

November 08, 2023

I’ve recently worked on a patch for the vc4 display driver used on the Raspberry Pi 4. To test this patch, I needed to compile the kernel and install it, something I know how to do on x86 but not on Raspberry Pi. Because I’m pretty stubborn I’ve also insisted on making my life harder:

  • I installed Arch Linux ARM as the base system, instead of Raspberry Pi OS or Raspbian.
  • I based my patches on top of the mainline kernel, instead of using Raspberry Pi’s tree.
  • I wanted to install my built kernel alongside the one provided by the distribution, instead of overwriting it.

Raspberry Pi has an official guide to compile the kernel, however it assumes Raspberry Pi OS, Raspberry Pi’s kernel tree, and overwrites the current kernel. It was still very useful to get an idea of the process. Still, quite a few adaptations have been required. This blog post serves as my personal notepad to remember how to Do It.

First, the official guide instructs us to run make bcm2711_defconfig to generate the kernel config, however mainline complains with:

Can't find default configuration "arch/arm/configs/bcm2711_defconfig"

This can be fixed by grabbing this file from the Raspberry Pi tree:

curl -L -o arch/arm/configs/bcm2711_defconfig "https://github.com/raspberrypi/linux/raw/rpi-6.1.y/arch/arm/configs/bcm2711_defconfig"

Once that’s done, compiling the kernel as usual works fine. Then we need to install it to the /boot partition. We can ignore the overlays stuff from the official guide, we don’t use these. The source paths need to be slightly adjusted, and the destination paths need to be fixed up to use a subdirectory:

doas make modules_install
doas cp arch/arm/boot/dts/broadcom/*.dtb /boot/custom/
doas cp arch/arm/boot/zImage /boot/custom/kernel7.img

Then we need to generate an initramfs. At first I forgot to do that step and the kernel was hanging around USB bus discovery.

doas mkinitcpio --generate /boot/custom/initramfs-linux.img --kernel /boot/custom/kernel7.img

The last step is updating the boot firmware configuration located at /boot/config.txt. Comment out any dtoverlay directive, then add os_prefix=custom/ to point the firmware to our subdirectory (note, the final slash is important).

For some reason my memory card was showing up as /dev/mmcblk1 instead of /dev/mmcblk0, so I had to bang my head against the wall until I notice the difference adjust /boot/cmdline.txt and /etc/fstab accordingly.

That’s it! After a reboot I was ready to start kernel hacking. Thanks to Maíra Canal for replying to my distress signal on Mastodon and providing recommendations!

November 07, 2023

TL;DR:

This blog post explores the color capabilities of AMD hardware and how they are exposed to userspace through driver-specific properties. It discusses the different color blocks in the AMD Display Core Next (DCN) pipeline and their capabilities, such as predefined transfer functions, 1D and 3D lookup tables (LUTs), and color transformation matrices (CTMs). It also highlights the differences in AMD HW blocks for pre and post-blending adjustments, and how these differences are reflected in the available driver-specific properties.

Overall, this blog post provides a comprehensive overview of the color capabilities of AMD hardware and how they can be controlled by userspace applications through driver-specific properties. This information is valuable for anyone who wants to develop applications that can take advantage of the AMD color management pipeline.

Get a closer look at each hardware block’s capabilities, unlock a wealth of knowledge about AMD display hardware, and enhance your understanding of graphics and visual computing. Stay tuned for future developments as we embark on a quest for GPU color capabilities in the ever-evolving realm of rainbow treasures.


Operating Systems can use the power of GPUs to ensure consistent color reproduction across graphics devices. We can use GPU-accelerated color management to manage the diversity of color profiles, do color transformations to convert between High-Dynamic-Range (HDR) and Standard-Dynamic-Range (SDR) content and color enhacements for wide color gamut (WCG). However, to make use of GPU display capabilities, we need an interface between userspace and the kernel display drivers that is currently absent in the Linux/DRM KMS API.

In the previous blog post I presented how we are expanding the Linux/DRM color management API to expose specific properties of AMD hardware. Now, I’ll guide you to the color features for the Linux/AMD display driver. We embark on a journey through DRM/KMS, AMD Display Manager, and AMD Display Core and delve into the color blocks to uncover the secrets of color manipulation within AMD hardware. Here we’ll talk less about the color tools and more about where to find them in the hardware.

We resort to driver-specific properties to reach AMD hardware blocks with color capabilities. These blocks display features like predefined transfer functions, color transformation matrices, and 1-dimensional (1D LUT) and 3-dimensional lookup tables (3D LUT). Here, we will understand how these color features are strategically placed into color blocks both before and after blending in Display Pipe and Plane (DPP) and Multiple Pipe/Plane Combined (MPC) blocks.

That said, welcome back to the second part of our thrilling journey through AMD’s color management realm!

AMD Display Driver in the Linux/DRM Subsystem: The Journey

In my 2022 XDC talk “I’m not an AMD expert, but…”, I briefly explained the organizational structure of the Linux/AMD display driver where the driver code is bifurcated into a Linux-specific section and a shared-code portion. To reveal AMD’s color secrets through the Linux kernel DRM API, our journey led us through these layers of the Linux/AMD display driver’s software stack. It includes traversing the DRM/KMS framework, the AMD Display Manager (DM), and the AMD Display Core (DC) [1].

The DRM/KMS framework provides the atomic API for color management through KMS properties represented by struct drm_property. We extended the color management interface exposed to userspace by leveraging existing resources and connecting them with driver-specific functions for managing modeset properties.

On the AMD DC layer, the interface with hardware color blocks is established. The AMD DC layer contains OS-agnostic components that are shared across different platforms, making it an invaluable resource. This layer already implements hardware programming and resource management, simplifying the external developer’s task. While examining the DC code, we gain insights into the color pipeline and capabilities, even without direct access to specifications. Additionally, AMD developers provide essential support by answering queries and reviewing our work upstream.

The primary challenge involved identifying and understanding relevant AMD DC code to configure each color block in the color pipeline. However, the ultimate goal was to bridge the DC color capabilities with the DRM API. For this, we changed the AMD DM, the OS-dependent layer connecting the DC interface to the DRM/KMS framework. We defined and managed driver-specific color properties, facilitated the transport of user space data to the DC, and translated DRM features and settings to the DC interface. Considerations were also made for differences in the color pipeline based on hardware capabilities.

Exploring Color Capabilities of the AMD display hardware

Now, let’s dive into the exciting realm of AMD color capabilities, where a abundance of techniques and tools await to make your colors look extraordinary across diverse devices.

First, we need to know a little about the color transformation and calibration tools and techniques that you can find in different blocks of the AMD hardware. I borrowed some images from [2] [3] [4] to help you understand the information.

Predefined Transfer Functions (Named Fixed Curves):

Transfer functions serve as the bridge between the digital and visual worlds, defining the mathematical relationship between digital color values and linear scene/display values and ensuring consistent color reproduction across different devices and media. You can learn more about curves in the chapter GPU Gems 3 - The Importance of Being Linear by Larry Gritz and Eugene d’Eon.

ITU-R 2100 introduces three main types of transfer functions:

  • OETF: the opto-electronic transfer function, which converts linear scene light into the video signal, typically within a camera.
  • EOTF: electro-optical transfer function, which converts the video signal into the linear light output of the display.
  • OOTF: opto-optical transfer function, which has the role of applying the “rendering intent”.

AMD’s display driver supports the following pre-defined transfer functions (aka named fixed curves):

  • Linear/Unity: linear/identity relationship between pixel value and luminance value;
  • Gamma 2.2, Gamma 2.4, Gamma 2.6: pure power functions;
  • sRGB: 2.4: The piece-wise transfer function from IEC 61966-2-1:1999;
  • BT.709: has a linear segment in the bottom part and then a power function with a 0.45 (~1/2.22) gamma for the rest of the range; standardized by ITU-R BT.709-6;
  • PQ (Perceptual Quantizer): used for HDR display, allows luminance range capability of 0 to 10,000 nits; standardized by SMPTE ST 2084.

These capabilities vary depending on the hardware block, with some utilizing hardcoded curves and others relying on AMD’s color module to construct curves from standardized coefficients. It also supports user/custom curves built from a lookup table.

1D LUTs (1-dimensional Lookup Table):

A 1D LUT is a versatile tool, defining a one-dimensional color transformation based on a single parameter. It’s very well explained by Jeremy Selan at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

It enables adjustments to color, brightness, and contrast, making it ideal for fine-tuning. In the Linux AMD display driver, the atomic API offers a 1D LUT with 4096 entries and 8-bit depth, while legacy gamma uses a size of 256.

3D LUTs (3-dimensional Lookup Table):

These tables work in three dimensions – red, green, and blue. They’re perfect for complex color transformations and adjustments between color channels. It’s also more complex to manage and require more computational resources. Jeremy also explains 3D LUT at GPU Gems 2 - Chapter 24 Using Lookup Tables to Accelerate Color Transformations

CTM (Color Transformation Matrices):

Color transformation matrices facilitate the transition between different color spaces, playing a crucial role in color space conversion.

HDR Multiplier:

HDR multiplier is a factor applied to the color values of an image to increase their overall brightness.

AMD Color Capabilities in the Hardware Pipeline

First, let’s take a closer look at the AMD Display Core Next hardware pipeline in the Linux kernel documentation for AMDGPU driver - Display Core Next

In the AMD Display Core Next hardware pipeline, we encounter two hardware blocks with color capabilities: the Display Pipe and Plane (DPP) and the Multiple Pipe/Plane Combined (MPC). The DPP handles color adjustments per plane before blending, while the MPC engages in post-blending color adjustments. In short, we expect DPP color capabilities to match up with DRM plane properties, and MPC color capabilities to play nice with DRM CRTC properties.

Note: here’s the catch – there are some DRM CRTC color transformations that don’t have a corresponding AMD MPC color block, and vice versa. It’s like a puzzle, and we’re here to solve it!

AMD Color Blocks and Capabilities

We can finally talk about the color capabilities of each AMD color block. As it varies based on the generation of hardware, let’s take the DCN3+ family as reference. What’s possible to do before and after blending depends on hardware capabilities describe in the kernel driver by struct dpp_color_caps and struct mpc_color_caps.

The AMD Steam Deck hardware provides a tangible example of these capabilities. Therefore, we take SteamDeck/DCN301 driver as an example and look at the “Color pipeline capabilities” described in the file: driver/gpu/drm/amd/display/dcn301/dcn301_resources.c

/* Color pipeline capabilities */

dc->caps.color.dpp.dcn_arch = 1; // If it is a Display Core Next (DCN): yes. Zero means DCE.
dc->caps.color.dpp.input_lut_shared = 0;
dc->caps.color.dpp.icsc = 1; // Intput Color Space Conversion  (CSC) matrix.
dc->caps.color.dpp.dgam_ram = 0; // The old degamma block for degamma curve (hardcoded and LUT). `Gamma correction` is the new one.
dc->caps.color.dpp.dgam_rom_caps.srgb = 1; // sRGB hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; // BT2020 hardcoded curve support (seems not actually in use)
dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; // Gamma 2.2 hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.pq = 1; // PQ hardcoded curve support
dc->caps.color.dpp.dgam_rom_caps.hlg = 1; // HLG hardcoded curve support
dc->caps.color.dpp.post_csc = 1; // CSC matrix
dc->caps.color.dpp.gamma_corr = 1; // New `Gamma Correction` block for degamma user LUT;
dc->caps.color.dpp.dgam_rom_for_yuv = 0;

dc->caps.color.dpp.hw_3d_lut = 1; // 3D LUT support. If so, it's always preceded by a shaper curve. 
dc->caps.color.dpp.ogam_ram = 1; // `Blend Gamma` block for custom curve just after blending
// no OGAM ROM on DCN301
dc->caps.color.dpp.ogam_rom_caps.srgb = 0;
dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0;
dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.dpp.ogam_rom_caps.pq = 0;
dc->caps.color.dpp.ogam_rom_caps.hlg = 0;
dc->caps.color.dpp.ocsc = 0;

dc->caps.color.mpc.gamut_remap = 1; // Post-blending CTM (pre-blending CTM is always supported)
dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; // Post-blending 3D LUT (preceded by shaper curve)
dc->caps.color.mpc.ogam_ram = 1; // Post-blending regamma.
// No pre-defined TF supported for regamma.
dc->caps.color.mpc.ogam_rom_caps.srgb = 0;
dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0;
dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0;
dc->caps.color.mpc.ogam_rom_caps.pq = 0;
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1; // Output CSC matrix.

I included some inline comments in each element of the color caps to quickly describe them, but you can find the same information in the Linux kernel documentation. See more in struct dpp_color_caps, struct mpc_color_caps and struct rom_curve_caps.

Now, using this guideline, we go through color capabilities of DPP and MPC blocks and talk more about mapping driver-specific properties to corresponding color blocks.

DPP Color Pipeline: Before Blending (Per Plane)

Let’s explore the capabilities of DPP blocks and what you can achieve with a color block. The very first thing to pay attention is the display architecture of the display hardware: previously AMD uses a display architecture called DCE

  • Display and Compositing Engine, but newer hardware follows DCN - Display Core Next.

The architectute is described by: dc->caps.color.dpp.dcn_arch

AMD Plane Degamma: TF and 1D LUT

Described by: dc->caps.color.dpp.dgam_ram, dc->caps.color.dpp.dgam_rom_caps,dc->caps.color.dpp.gamma_corr

AMD Plane Degamma data is mapped to the initial stage of the DPP pipeline. It is utilized to transition from scanout/encoded values to linear values for arithmetic operations. Plane Degamma supports both pre-defined transfer functions and 1D LUTs, depending on the hardware generation. DCN2 and older families handle both types of curve in the Degamma RAM block (dc->caps.color.dpp.dgam_ram); DCN3+ separate hardcoded curves and 1D LUT into two block: Degamma ROM (dc->caps.color.dpp.dgam_rom_caps) and Gamma correction block (dc->caps.color.dpp.gamma_corr), respectively.

Pre-defined transfer functions:

  • they are hardcoded curves (read-only memory - ROM);
  • supported curves: sRGB EOTF, BT.709 inverse OETF, PQ EOTF and HLG OETF, Gamma 2.2, Gamma 2.4 and Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. Setting TF = Identity/Default and LUT as NULL means bypass.

References:

AMD Plane 3x4 CTM (Color Transformation Matrix)

AMD Plane CTM data goes to the DPP Gamut Remap block, supporting a 3x4 fixed point (s31.32) matrix for color space conversions. The data is interpreted as a struct drm_color_ctm_3x4. Setting NULL means bypass.

References:

AMD Plane Shaper: TF + 1D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The Shaper block fine-tunes color adjustments before applying the 3D LUT, optimizing the use of the limited entries in each dimension of the 3D LUT. On AMD hardware, a 3D LUT always means a preceding shaper 1D LUT used for delinearizing and/or normalizing the color space before applying a 3D LUT, so this entry on DPP color caps dc->caps.color.dpp.hw_3d_lut means support for both shaper 1D LUT and 3D LUT.

Pre-defined transfer function enables delinearizing content with or without shaper LUT, where AMD color module calculates the resulted shaper curve. Shaper curves go from linear values to encoded values. If we are already in a non-linear space and/or don’t need to normalize values, we can set a Identity TF for shaper that works similar to bypass and is also the default TF value.

Pre-defined transfer functions:

  • there is no DPP Shaper ROM. Curves are calculated by AMD color modules. Check calculate_curve() function in the file amd/display/modules/color/color_gamma.c.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting Plane Shaper TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT as NULL works as bypass.

References:

AMD Plane 3D LUT

Described by: dc->caps.color.dpp.hw_3d_lut

The 3D LUT in the DPP block facilitates complex color transformations and adjustments. 3D LUT is a three-dimensional array where each element is an RGB triplet. As mentioned before, the dc->caps.color.dpp.hw_3d_lut describe if DPP 3D LUT is supported.

The AMD driver-specific property advertise the size of a single dimension via LUT3D_SIZE property. Plane 3D LUT is a blog property where the data is interpreted as an array of struct drm_color_lut elements and the number of entries is LUT3D_SIZE cubic. The array contains samples from the approximated function. Values between samples are estimated by tetrahedral interpolation The array is accessed with three indices, one for each input dimension (color channel), blue being the outermost dimension, red the innermost. This distribution is better visualized when examining the code in [RFC PATCH 5/5] drm/amd/display: Fill 3D LUT from userspace by Alex Hung:

+	for (nib = 0; nib < 17; nib++) {
+		for (nig = 0; nig < 17; nig++) {
+			for (nir = 0; nir < 17; nir++) {
+				ind_lut = 3 * (nib + 17*nig + 289*nir);
+
+				rgb_area[ind].red = rgb_lib[ind_lut + 0];
+				rgb_area[ind].green = rgb_lib[ind_lut + 1];
+				rgb_area[ind].blue = rgb_lib[ind_lut + 2];
+				ind++;
+			}
+		}
+	}

In our driver-specific approach we opted to advertise it’s behavior to the userspace instead of implicitly dealing with it in the kernel driver. AMD’s hardware supports 3D LUTs with 17-size or 9-size (4913 and 729 entries respectively), and you can choose between 10-bit or 12-bit. In the current driver-specific work we focus on enabling only 17-size 12-bit 3D LUT, as in [PATCH v3 25/32] drm/amd/display: add plane 3D LUT support:

+		/* Stride and bit depth are not programmable by API yet.
+		 * Therefore, only supports 17x17x17 3D LUT (12-bit).
+		 */
+		lut->lut_3d.use_tetrahedral_9 = false;
+		lut->lut_3d.use_12bits = true;
+		lut->state.bits.initialized = 1;
+		__drm_3dlut_to_dc_3dlut(drm_lut, drm_lut3d_size, &lut->lut_3d,
+					lut->lut_3d.use_tetrahedral_9,
+					MAX_COLOR_3DLUT_BITDEPTH);

A refined control of 3D LUT parameters should go through a follow-up version or generic API.

Setting 3D LUT to NULL means bypass.

References:

AMD Plane Blend/Out Gamma: TF + 1D LUT

Described by: dc->caps.color.dpp.ogam_ram

The Blend/Out Gamma block applies the final touch-up before blending, allowing users to linearize content after 3D LUT and just before the blending. It supports both 1D LUT and pre-defined TF. We can see Shaper and Blend LUTs as 1D LUTs that are sandwich the 3D LUT. So, if we don’t need 3D LUT transformations, we may want to only use Degamma block to linearize and skip Shaper, 3D LUT and Blend.

Pre-defined transfer function:

  • there is no DPP Blend ROM. Curves are calculated by AMD color modules;
  • supported curves: Identity, sRGB EOTF, BT.709 inverse OETF, PQ EOTF, HLG inverse OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. If plane_blend_tf_property != Identity TF, AMD color module will combine the user LUT values with pre-defined TF into the LUT parameters to be programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

MPC Color Pipeline: After Blending (Per CRTC)

DRM CRTC Degamma 1D LUT

The degamma lookup table (LUT) for converting framebuffer pixel data before apply the color conversion matrix. The data is interpreted as an array of struct drm_color_lut elements. Setting NULL means bypass.

Not really supported. The driver is currently reusing the DPP degamma LUT block (dc->caps.color.dpp.dgam_ram and dc->caps.color.dpp.gamma_corr) for supporting DRM CRTC Degamma LUT, as explaning by [PATCH v3 20/32] drm/amd/display: reject atomic commit if setting both plane and CRTC degamma.

DRM CRTC 3x3 CTM

Described by: dc->caps.color.mpc.gamut_remap

It sets the current transformation matrix (CTM) apply to pixel data after the lookup through the degamma LUT and before the lookup through the gamma LUT. The data is interpreted as a struct drm_color_ctm. Setting NULL means bypass.

DRM CRTC Gamma 1D LUT + AMD CRTC Gamma TF

Described by: dc->caps.color.mpc.ogam_ram

After all that, you might still want to convert the content to wire encoding. No worries, in addition to DRM CRTC 1D LUT, we’ve got a AMD CRTC gamma transfer function (TF) to make it happen. Possible TF values are defined by enum amdgpu_transfer_function.

Pre-defined transfer functions:

  • there is no MPC Gamma ROM. Curves are calculated by AMD color modules.
  • supported curves: Identity, sRGB inverse EOTF, BT.709 OETF, PQ inverse EOTF, HLG OETF, and Gamma 2.2, Gamma 2.4, Gamma 2.6 inverse EOTF.

The 1D LUT currently accepts 4096 entries of 8-bit. The data is interpreted as an array of struct drm_color_lut elements. When setting CRTC Gamma TF (!= Identity) and LUT at the same time, the color module will combine the pre-defined TF and the custom LUT values into the LUT that’s actually programmed. Setting TF = Identity/Default and LUT to NULL means bypass.

References:

Others

AMD CRTC Shaper and 3D LUT

We have previously worked on exposing CRTC shaper and CRTC 3D LUT, but they were removed from the AMD driver-specific color series because they lack userspace case. CRTC shaper and 3D LUT works similar to plane shaper and 3D LUT but after blending (MPC block). The difference here is that setting (not bypass) Shaper and Gamma blocks together are not expected, since both blocks are used to delinearize the input space. In summary, we either set Shaper + 3D LUT or Gamma.

Input and Output Color Space Conversion

There are two other color capabilities of AMD display hardware that were integrated to DRM by previous works and worth a brief explanation here. The DC Input CSC sets pre-defined coefficients from the values of DRM plane color_range and color_encoding properties. It is used for color space conversion of the input content. On the other hand, we have de DC Output CSC (OCSC) sets pre-defined coefficients from DRM connector colorspace properties. It is uses for color space conversion of the composed image to the one supported by the sink.

References:

The search for rainbow treasures is not over yet

If you want to understand a little more about this work, be sure to watch Joshua and I presented two talks at XDC 2023 about AMD/Steam Deck colors on Gamescope:

In the time between the first and second part of this blog post, Uma Shashank and Chaitanya Kumar Borah published the plane color pipeline for Intel and Harry Wentland implemented a generic API for DRM based on VKMS support. We discussed these two proposals and the next steps for Color on Linux during the Color Management workshop at XDC 2023 and I briefly shared workshop results in the 2023 XDC lightning talk session.

The search for rainbow treasures is not over yet! We plan to meet again next year in the 2024 Display Hackfest in Coruña-Spain (Igalia’s HQ) to keep up the pace and continue advancing today’s display needs on Linux.

Finally, a HUGE thank you to everyone who worked with me on exploring AMD’s color capabilities and making them available in userspace.

November 06, 2023

 If you remember the last update two weeks ago, I got MobileNetV1 working with good performance, and I was planning to move to upstreaming my changes to the Linux kernel and Mesa.

One of the kernel patches is now queued for the 6.7 release of the Linux kernel, and the other one has just been resent for reviews.

Regarding Mesa, I have made several cleanups and have started getting great review comments from Christian Gmeiner.

While waiting for feedback, I have started work on using the TP cores for tensor manipulation, which should be many times faster  than the naive code I was running on the CPU for this.

Got some jobs producing the correct results, but I'm facing a problem with the GPU hanging right afterwards. Have already made a pass at the whole set of data that is sent to the HW (unit configuration, command stream and registers), but haven't found yet the problem. I will next improve the tooling around this and get a better view of the differences.

I hacked Mesa to use the out-of-tree driver and my code works that way, so it has to be something at the kernel driver.

During the next weeks I will keep incorporating feedback and see how I can fix the GPU hang on TP jobs.


November 05, 2023

Linus has pulled the initial GSP firmware support for nouveau. This is just the first set of work to use the new GSP firmware and there are likely many challenges and improvements ahead.

To get this working you need to install the firmware which hasn't landed in linux-firmware yet.

For Fedora this copr has the firmware in the necessary places:

https://copr.fedorainfracloud.org/coprs/airlied/nouveau-gsp/build/6593115/ 

Hopefully we can upstream that in next week or so.

If you have an ADA based GPU then it should just try and work out of the box, if you have Turing or Ampere you currently need to pass nouveau.config=NvGspRm=1 on the kernel command line to attempt to use GSP.

Going forward, I've got a few fixes and stabilization bits to land, which we will concentrate on for 6.7, then going forward we have to work out how to keep it up to date and support new hardware and how to add new features.


November 03, 2023

This is the second part of the Xwayland rootful post, the first part is there

Using Xwayland rootful to run a full X11 desktop

Xwayland rootful can run more than just a window manager, it can as well run an entire X11 desktop, for example with Xfce:

$ Xwayland -geometry 1024x768 -decorate :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Xfce running on Xwayland rootful in GNOME Shell on Wayland


Unfortunately, not all the keyboard shortcuts within the nested X11 session actually work, because some of those (such a Alt-Tab for example) get processed by the Wayland compositor directly, instead of being forwarded to the nested environment.

This however isn't a problem specific to Wayland or Xwayland, an X11 window manager running in Xnest or Xephyr will have the same issues with keyboard shortcuts. To avoid that, Xephyr is able to „grab“ the keyboard and pointer so that all input events end up in the nested X11 session and do not get processed by the parent session.

Xwayland 23.1 has a similar functionality using the Wayland pointer locking & confinement protocol and the keyboard shortcuts inhibitor protocol.

So if your favorite Wayland compositor supports these protocols (in doubt, you can check that it is the case using „wayland-info“), you can use the „-host-grab“ option in Xwayland rootful:

$ Xwayland -geometry 1024x768 -decorate -host-grab :12 &
DISPLAY=:12 SESSION_MANAGER= GDK_BACKEND=x11 dbus-run-session startxfce4

Pressing the Control and Shift keys simultaneously will release the keyboard and pointer (just like with Xephyr actually).

Using Xwayland rootful to run a single X11 application

In some cases, it might be desirable to run a single X11 application isolated from the rest of the X11 clients, on its own X11 server.

On such a setup, one could run a single X11 client either maximized or fullscreen within Xwayland rootful.

Since Xwayland 23.2 allows to interactively resize the root window, users could mode and resize that window at will.

But for that to work, we need a simple X11 window manager that could resize the X11 client window along with the root window, using XRANDR notifications, such as the matchbox window manager for example.

$ Xwayland -geometry 1024x768 -decorate :12 &
matchbox-window-manager -display :12 &
$ GDK_BACKEND=x11 midori --display=:12

When the Xwayland rootful window is resized, corresponding XRANDR events are emitted, notifying the X11 window manager which in turn resizes the client window.

Using Xwayland rootful fullscreen

For years now, Xwayland rootless had support for the viewport Wayland protocol, to emulate XRandR for legacy games thanks to the work from Hans De Goede.

So the idea is to add a fullscreen mode to Xwayland rootful and take advantage of the Wayland viewports support to emulate resolution changes.

This is exactly what the „-fullscreen“ command line options does, it starts Xwayland rootful in fullscreen mode using the xdg_toplevel Wayland protocol and uses the existing viewport support to scale the window and to match the actual display physical resolution.

The emulated resolution is not even limited by the physical resolution, it's possible to use XRANDR to select an emulated resolution much higher than the actual monitor's resolution, quite handy to test X11 applications on high resolution without having to purchase expensive monitors!

$ Xwayland -fullscreen :12 &
matchbox-window-manager -display :12 &
$ xterm -display :12 &
$ xrandr -s 5120x2880 -display :12

Are we done yet?

Well, there's still one thing Xwayland is not handling well, it's HiDPI and fractional scaling.

With rootless Xwayland (as on a typical Wayland desktop session), all X11 clients share the same Xwayland server, and can span across different Wayland outputs of different scales.

Even though theoretically each Wayland surface associated with each X11 window could have a different scale factor set by Xwayland, all X11 clients on the same Xserver share the same coordinate space, so in practice different X11 windows cannot have different scale factors applied.

That's the reason why all the existing merge requests to add support for HiDPI to Xwayland set the same scale to all X11 surfaces. But that means that the rendered surface could end up being way too small depending on the actual scale the window is placed on, on a mixed-DPI multi-monitor setup (I already shared my views of the problem in this issue upstream).

But such limitation does not apply to rootful Xwayland, considering that all the X11 clients running on a rootful Xwayland actually belong to and remain within the same visible root window. They are part of the same visual entity and move all together along with the Xwayland rootful window.

So we could possibly add support for HiDPI (and hence achieve fractional scaling without blurred fonts) to rootful Xwayland. The idea is that Xwayland would set the surface scale to match the scale of the output it's placed on, and automatically resize its root window according to the scale, whenever that changes or when the rootful Xwayland window is moved from one monitor to another.

So for example, when Xwayland rootful with a size of 640×480 is moved from an output with scale 1 to an output with scale 2, the size of the root window (hence the Xwayland rootful window) would be automatically changed to 1280×960, along with the corresponding XRANDR notifications so that an X11 window manager running nested can adjust the X11 clients size and positions.

And if we want a way to communicate that to the X11 clients running within Xwayland rootful, we can use an X11 property on the root window that reflects the actual scale factor being applied. An X11 client could either use that property directly, or more likely, a simple dedicated daemon could adjust the scaling factor of the various X11 toolkits depending on the value set for Wayland scaling.

That's what that proposed merge request upstream does.

gnome-calculator running on Xwayland rootful with 150% fractional scaling

Of course, at this time of writing, this is just a merge request I just posted upstream, and there is no promise that it will accepted eventually. We'll see how that goes, but if that could find its way to Xwayland upstream, it would be part of the next major release of Xwayland some time next year.

October 30, 2023

I was at XDC 2023 in A Coruña a few days ago where I had the opportunity to talk about some of the work we have been doing on the Raspberry Pi driver stack together with my colleagues Juan Suárez and Maíra Canal. We talked about Raspberry Pi 5, CPU job handling in the Vulkan driver, OpenGL 3.1 support and how we are exposing GPU stats to user space. If you missed it here is the link to Youtube.

Big thanks to Igalia for organizing it and to all the sponsors and specially to Samuel and Chema for all the work they put into making this happen.

October 27, 2023

🪑?

October 26, 2023

And Now For Something Slightly More Technical

It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.

Swapchain readback.

So Easy Even You Could Accidentally Do It

I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.

I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:

  • draw some stuff
  • swapbuffers
  • blitframebuffer

And this happened on every single frame (???).

In Zink Terms…

This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.

Vulkan doesn’t allow readback from swapchains. By this, I mean:

  • swapchain images must be acquired before they can be accessed for any purpose
  • there is no method to explicitly reacquire a specific swapchain image
  • there is no guarantee that swapchain images are unchanged after present

Combined, once you have presented a swapchain image you’re screwed.

…According to the spec, that is. In the real world, things work differently.

Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.

P E R F

This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…

Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.

This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.

The functionality is already merged for the upcoming 23.3 release.

Footgun as hard as you want.

October 25, 2023

More Milestones

As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.

This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.

Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.

October 24, 2023

Hi all, long time no see! It’s been more than two months since the last status update. My excuse for this silence is two-fold: I was on leave for 5 weeks, and then X.Org Developer’s Conference happened. During my time off, I’ve traveled in Korea and Japan. I will be blunt: these last two months have been fantastic! And to be honest, that’s a huge understatement.

Busan view from Jangsan

East gate

After my trip in Asia, I went to a 2-day Valve hackfest in Igalia’s headquarters. I met other Valve contractors there, we discussed about various topics such as color management, variable refresh rate, flicker-free startup, and more.

At XDC, there were lots of interesting talks and workshops: HDR by Joshua and Melissa, NVK by Faith, Asahi by Alyssa et al, wlroots frame scheduling by Rose (my GSoC student), CI by Martin, VKMS by Maíra, Wine Wayland by Alexandros, Wine X11 by Arek, and many more! Everything should be available online if you haven’t watched live. That said, as usual, the part I enjoyed the most is the so-called hallway track. It’s great to have free-form discussions with fellow graphics developers, it results in a pretty different train of thought than the usual focused discussions we have online.

Apart from these events, I’ve found some time to do a bit of actual work, too. I’ve re-spinned an old patch I wrote to introduce a new CLOSEFB IOCTL, to allow a DRM master to leave a framebuffer on-screen when quitting so that the next DRM master can take over without a black screen in-between. This time I also included a user-space patch and an IGT test (both requirements for new kernel uAPI). I sent (and merged) another kernel patch to fix black screens in some situations when unplugging USB-C docks.

On the Wayland side, I continued working on explicit synchronization, updating the protocol and submitting a gamescope patch. Joshua has been working on a Mesa patch, so all of the pieces are coming together now. On the SourceHut side, I’ve sent a patch to add HTTP/2 support to pages.sr.ht. It’s been merged and deployed, enjoy! The NPotTM is libicc, a small library to parse ICC profile files. Unlike LittleCMS, it provides lower-level access to the ICC structure and the exact color transformation operations.

That’s all for now, see you next month!

There is an issue with the rpmfusion packaged IPU6 camera stack for Fedora is not working on many Dell laptop models after upgrading the kernel to a 6.5.y kernel.

This is caused by a new mainline ov0a10 sensor driver which takes precedence over the akmod ov0a10 driver but lacks VSC integration.

This can be worked around by running the following command:
sudo rm /lib/modules/$(uname -r)/kernel/drivers/media/i2c/ov01a10.ko.xz; sudo depmod -a

After the rm + depmod run:
sudo rmmod ov01a10; sudo modprobe ov01a10

Or reboot. After this your camera will hopefully work again.

I have submitted a pull-request to disable the mainline kernel's non working ov01a10 driver, so after the next Fedora kernel update this workaround should no longer be necessary.

Your Bug Has Already Been Solved

After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.

Some of those crashes, however, are not my bugs. They’re system bugs.

In particular, any of you still using Xorg instead of Wayland will want to create this file:

$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
	Option "Debug" "dmabuf_capable"
EndSection

This makes your xserver dmabuf-capable, which will be more successful when running things with zink.

Another problem you’re likely to have is this console error:

DRI3 not available
failed to load driver: zink

Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu.

Delete this package.

Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.

If you’re still having problems after checking for both of these issues, try turning your computer on.

October 23, 2023

Progress

Since the last update I finally got the whole of MobileNetv1 running at full-accuracy on the NPU with Mesa: 
tomeu@arm-64:~/mesa$ python3.10 classification.py -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from libteflon.so with args: {}
Processing the input took 18 ms.
Running the NN job took 13 ms.
Processing the output took 1 ms.
0.866667: military uniform
0.031373: Windsor tie
0.015686: mortarboard
0.007843: bow tie
0.007843: academic gown
time: 33.094ms
That takes us to a performance level around 3 times faster than running the same inference on the CPUs on the A311D SoC.

Most of the time (18 ms.) is spent in my naive manipulation of the input tensor, transposing and reshuffling it to match what the HW expects. Once we learn to do these operations on the 4 tensor manipulation cores, this time should be brought close to zero.

The 13 ms. that the convolutions take in the NPU is still sensibly higher than the 8 ms. that the blob achieves, but the optimizations mentioned in previous updates in this blog should bring us pretty close.
 

Next steps

Now that we have something that people can use in their products, I will switch to upstreaming mode.

I want to do a few cleanups to the Mesa code and then I will ask for people to review and ack so it can be merged. In the meantime, the draft merge request can be found here.

I would also like to have a CI job running to make sure it doesn't regress. But given that we don't use NIR as of yet and the dependencies with the rest of Mesa are minimal, there is probably little need as long as I'm the only person contributing to the code.


Almost That Time Again

As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of shitpostinghighly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.

This is Day 1.

Zink: No Longer A Hacky Workaround Driver

2023 has seen great strides in the zink ecosystem:

  • Some games, most notably my favorite game of all time X-Plane, are now shipping zink in order to have a consistent GL experience across platforms
  • Zink has reached official GL 4.6 conformance on Imagination GPUs and will be shipping as their GL implementation
  • Zink can now run display servers for both X and Wayland, enabling full systems to exist without a native GL implementation

And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.

MESA_LOADER_DRIVER_OVERRIDE=zink has to be specified in order to use zink, even if no other GL drivers exist on the system.

Or Does It?

Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.

Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.

The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.

I can’t imagine anyone will need it, but remember that issues can be reported here.

October 20, 2023

A bit of background

Xwayland is intended as a compatibility layer, to allow legacy X11 applications to continue to work in a Wayland environment.

Most Wayland compositors run Xwayland „rootless“ (using the command line option „-rootless“ when spawning Xwayland) so that X11 clients can integrate seamlessly with the other Wayland native clients, the Wayland compositor taking care of stacking the various windows (or surfaces) regardless of the client being X11 or Wayland native.

That actually works very well, so well that in many cases users do not even realize that any particular client is still running on X11, using Xwayland.

For that to work, the Wayland compositor needs to integrate a fully functional X11 window manager.

Sometimes, however, it is useful to use a separate X11 server to run X11 applications with another X11 window manager or even a full X11 environment.

Nested X11 servers

With X11, it is possible to run a nested X11 server such as Xnest or Xephyr, and run a full X11 environment within those nested X servers.

That can be useful for a number of reasons, like connecting remotely to a remote legacy Unix server using XDMCP (not that I would recommend that anyway!), or for testing a particular X11 application with different window managers, or even because a particular X11 application is certified only with a specific window manager. The possibilities are endless.

$ Xephyr -retro -screen 1024x768 :12

Xephyr running the Motif window manager on a GNOME Shell Wayland session

But Xnest or Xephyr are X11 clients themselves, meaning that they run on top of Xwayland when running on a Wayland compositor. That's a bit of a waste, using two X11 servers on top of a Wayland compositor.

Besides, with X.org development winding down, downstream maintainers and packagers may want to reduce the number of X11 servers they ship and have to maintain in the future.

What's wrong with Xwayland rootful?

Right, so if Xwayland already runs rootful by default, why not just using that instead of Xnest or Xephyr?

Well, up until Xwayland 23.1, Xwayland rootful would take its screen configuration from the Wayland compositor itself (using the wl_output or xdg-output Wayland protocols), meaning that when running rootful, Xwayland would map a surface the size of all the monitors, and the user would have no way to easily move or resize it.

That's far from being practical, especially when using a multi-monitor setup!

Making Xwayland rootful (more) usable

So the first step to help making Xwayland rootful suitable as a nested X11 server is to provide a command line option to specify the desired size of the Xwayland window.

That's the „-geometry“ option introduced in Xwayland 23.1 so that one can specifies the desired size of the Xwayland rootful window:

$ Xwayland -geometry 1024x768 :12

That will grant you with a black window of the specified size. If you want more of a „retro“ look, you can get the classic stipple and the X cursor, you can use:

$ Xwayland -geometry 1024x768 -retro :12

Still, the Xwayland window is missing a title bar that would allow for moving the window around.

This is because Wayland does not decorate its surfaces, this is left to the Wayland client themselves to add window decorations (also known as client side decorations, or CSD for short).

This however would add a lot of complexity to Xwayland (which is primarily an Xserver, not a full fledged Wayland application). Thankfully, there is libdecor which can fence Xwayland from that complexity and provide window decorations for us.

So if libdecor is installed on the system and Xwayland is built with libdecor enabled (this is an optional dependency though), then we can request that Xwayland uses decorations with the „-decorate“ command line option:

$ Xwayland -geometry 1024x768 -retro -decorate :12


No we can have fun running some legacy X11 applications on that Xwayland rootful server:

$ xterm -display :12 &
$ twm -display :12 &
$ xsetroot -solid dodgerblue -display :12


We can even use „xrandr“ to query the size of the Xwayland window and resize it:


New with Xwayland 23.2, the Xwayland window is also resize-able interactively and the resulting display size is available in XRandR, creating an XRandR configuration to match the actual window size set interactively by the user:



October 18, 2023

This is my first blog post, ever!

I'm afraid there isn't much yet, but my intention is to post things related to Xwayland and various other projects I contribute to.

October 12, 2023

EBUSY

As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.

But there have been updates, and I’m gonna round ‘em all up.

R A Y T R A C W T F

Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.

He’s now implemented raytracing for Lavapipe.

It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.

CLosure

I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.

While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.

Fixups

A number of longstanding bugs have recently been fixed.

Wolfenstein Face

Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:

wolf-face.png

Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.

Apitrace: The Final Frontier

Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.

  • A number of games could be traced, but then replaying those traces would crash at a certain point. This is now fixed, enabling better bug reporting for a large number of AAA games from the the last decade.
  • Another set of games using the id Engine could record traces, but then replaying them would fail to render correctly:

wolf-trace.png

This affects (at least) Wolfenstein: The Old Blood and DOOM2016, but the problem has been identified, and a fix is on the way.

Zink: Exploring New Display Systems

After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.

The Real Post

Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.

During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.

While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.

So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.

The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:

  • zink receives a (transfer) command sequence which is impossible to reorder and must split the renderpass to execute copies/blits
  • the app randomly flushes during rendering
  • the GL frontend hits a TC synchronization point and halts the recording thread to wait for the driver thread to finish execution

The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.

Texture uploading.

Texture Uploads: How Do They Work?

There are (currently) three methods by which TC can perform texture uploads:

  • for small uploads, the data is enqueued and passed asynchronously to the driver thread
  • for larger uploads:
    • if renderpass tracking is enabled and a renderpass is active, the upload will be sequenced into N strided uploads and passed asynchronously to the driver thread to avoid splitting renderpasses
    • otherwise TC syncs the driver thread and performs the upload directly

Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.

You can see where this is going.

TC Execution: Define Optimal

Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:

ideal.png

Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:

  • the sequenced upload case will have more work, so the driver thread will run a bit longer than it otherwise would, resulting in the GL frontend waiting a bit longer than it otherwise would for completion

copies.png

  • the sync upload case creates a bubble in TC execution

sync.png

Solve For P

To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.

Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use image, which would cause unpredictable (but definitely wrong) output. In this context, in-use can be defined as an image which is either:

  • enqueued in a TC batch for execution
  • enqueued/active in a GPU submission

On the plus side, pipe_context::is_resource_busy exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.

To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • delete staging image

For such a case, it’d be trivial to add a seen flag to struct threaded_resource and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:

  • create staging image
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame
  • upload texture data to staging image
  • draw to scene while sampling staging image
  • cache staging image for reuse
  • render frame

For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.

The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.

For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.

As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.

The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.

October 10, 2023

At the moment I am hard at work putting together the final bits for the AppStream 1.0 release (hopefully to be released this month). The new release comes with many new new features, an improved developer API and removal of most deprecated things (so it carefully breaks compatibility with very old data and the previous C API). One of the tasks for the upcoming 1.0 release was #481 asking about a formal way to distinguish Linux phone applications from desktop applications.

AppStream infamously does not support any “is-for-phone” label for software components, instead the decision whether something is compatible with a device is based the the device’s capabilities and the component’s requirements. This allows for truly adaptive applications to describe their requirements correctly, and does not lock us into “form factors” going into the future, as there are many and the feature range between a phone, a tablet and a tiny laptop is quite fluid.

Of course the “match to current device capabilities” check does not work if you are a website ranking phone compatibility. It also does not really work if you are a developer and want to know which devices your component / application will actually be considered compatible with. One goal for AppStream 1.0 is to have its library provide more complete building blocks to software centers. Instead of just a “here’s the data, interpret it according to the specification” API, libappstream now interprets the specification for the application and provides API to handle most common operations – like checking device compatibility. For developers, AppStream also now implements a few “virtual chassis configurations”, to roughly gauge which configurations a component may be compatible with.

To test the new code, I ran it against the large Debian and Flatpak repositories to check which applications are considered compatible with what chassis/device type already. The result was fairly disastrous, with many applications not specifying compatibility correctly (many do, but it’s by far not the norm!). Which brings me to the actual topic of this blog post: Very few seem to really know how to mark an application compatible with certain screen sizes and inputs! This is most certainly a matter of incomplete guides and good templates, so maybe this post can help with that a bit:

The ultimate cheat-sheet to mark your app “chassis-type” compatible

As a quick reminder, compatibility is indicated using AppStream’s relations system: A requires relation indicates that the system will not run at all or will run terribly if the requirement is not met. If the requirement is not met, it should not be installable on a system. A recommends relation means that it would be advantageous to have the recommended items, but it’s not essential to run the application (it may run with a degraded experience without the recommended things though). And a supports relation means a given interface/device/control/etc. is supported by this application, but the application may work completely fine without it.

I have a desktop-only application

A desktop-only application is characterized by needing a larger screen to fit the application, and requiring a physical keyboard and accurate mouse input. This type is assumed by default if no capabilities are set for an application, but it’s better to be explicit. This is the metadata you need:

<component type="desktop-application">
  <id>org.example.desktopapp</id>
  <name>DesktopApp</name>
  [...]
  <requires>
    <display_length>768</display_length>

    <control>keyboard</control>
    <control>pointing</control>
  </requires>
  [...]
</component>

With this requires relation, you require a small-desktop sized screen (at least 768 device-independent pixels (dp) on its smallest edge) and require a keyboard and mouse to be present / connectable. Of course, if your application needs more minimum space, adjust the requirement accordingly. Note that if the requirement is not met, your application may not be offered for installation.

Note: Device-independent / logical pixels

One logical pixel (= device independent pixel) roughly corresponds to the visual angle of one pixel on a device with a pixel density of 96 dpi (for historical X11 reasons) and a distance from the observer of about 52 cm, making the physical pixel about 0.26 mm in size. When using logical pixels as unit, they might not always map to exact physical lengths as their exact size is defined by the device providing the display. They do however accurately depict the maximum amount of pixels that can be drawn in the depicted direction on the device’s display space. AppStream always uses logical pixels when measuring lengths in pixels.

I have an application that works on mobile and on desktop / an adaptive app

Adaptive applications have fewer hard requirements, but a wide range of support for controls and screen sizes. For example, they support touch input, unlike desktop apps. An example MetaInfo snippet for these kind of apps may look like this:

<component type="desktop-application">
  <id>org.example.adaptive_app</id>
  <name>AdaptiveApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <supports>
    <control>keyboard</control>
    <control>pointing</control>
    <control>touch</control>
  </supports>
  [...]
</component>

Unlike the pure desktop application, this adaptive application requires a much smaller lowest display edge length, and also supports touch input, in addition to keyboard and mouse/touchpad precision input.

I have a pure phone/table app

Making an application a pure phone application is tricky: We need to mark it as compatible with phones only, while not completely preventing its installation on non-phone devices (even though its UI is horrible, you may want to test the app, and software centers may allow its installation when requested explicitly even if they don’t show it by default). This is how to achieve that result:

<component type="desktop-application">
  <id>org.example.phoneapp</id>
  <name>PhoneApp</name>
  [...]

  <requires>
    <display_length>360</display_length>
  </requires>

  <recommends>
    <display_length compare="lt">1280</display_length>
    <control>touch</control>
  </recommends>
  [...]
</component>

We require a phone-sized display minimum edge size (adjust to a value that is fit for your app!), but then also recommend the screen to have a smaller edge size than a larger tablet/laptop, while also recommending touch input and not listing any support for keyboard and mouse.

Please note that this blog post is of course not a comprehensive guide, so if you want to dive deeper into what you can do with requires/recommends/suggests/supports, you may want to have a look at the relations tags described in the AppStream specification.

Validation

It is still easy to make mistakes with the system requirements metadata, which is why AppStream 1.0 will provide more commands to check MetaInfo files for system compatibility. Current pre-1.0 AppStream versions already have an is-satisfied command to check if the application is compatible with the currently running operating system:

:~$ appstreamcli is-satisfied ./org.example.adaptive_app.metainfo.xml
Relation check for: */*/*/org.example.adaptive_app/*

Requirements:
 • Unable to check display size: Can not read information without GUI toolkit access.
Recommendations:
 • No recommended items are set for this software.
Supported:
  Physical keyboard found.
  Pointing device (e.g. a mouse or touchpad) found.
 • This software supports touch input.

In addition to this command, AppStream 1.0 will introduce a new one as well: check-syscompat. This command will check the component against libappstream’s mock system configurations that define a “most common” (whatever that is at the time) configuration for a respective chassis type.

If you pass the --details flag, you can even get an explanation why the component was considered or not considered for a specific chassis type:

:~$ appstreamcli check-syscompat --details ./org.example.phoneapp.metainfo.xml
Chassis compatibility check for: */*/*/org.example.phoneapp/*

Desktop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Laptop:
  Incompatible
 • recommends: This software recommends a display with its shortest edge 
   being << 1280 px in size, but the display of this device has 1280 px.
 • recommends: This software recommends a touch input device.

Server:
  Incompatible
 • requires: This software needs a display for graphical content.
 • recommends: This software needs a display for graphical content.
 • recommends: This software recommends a touch input device.

Tablet:
  Compatible (100%)

Handset:
  Compatible (100%)

I hope this is helpful for people. Happy metadata writing! 😀

October 06, 2023

Progress

Last week I was a bit distracted with the trip to Paris for the Embedded Recipes conference, but later I have found some time for hacking and got some interesting results out of it.

Refactored the Gallium front-end

As commented in the previous update, I had found some limits in my testing due to the naive way that the front-end was scheduling jobs to the Gallium hardware-dependent driver.

I got to basically rewrite it (and removed any C++ remnants, on the way) and moved to a model in which the drivers would compile the operation blocks that they support to a format that can be quickly sent to the hardware.

As a side effect, I got proper memory management of the workload which allowed me to expand the testing I can do in a reasonable amount of time.

Also took the chance to rewrite the higher level scheduling data structure so all jobs in the same model partition are sent to the hardware in a single batch, for decreased latency.

Unfortunately I didn't get to remove copies of input and output tensors because the TensorFlow Lite API for this (TfLiteAsyncKernel) is undocumented and far from trivial. They seem to just be adding stuff on top to abstract whatever the Android folks may end up wanting to do.

Got MobileNet V1 to run

As part of the refactoring  from above, I got multiple operations in the same model to work, which got us to correctly running some inferences, even if at low accuracy rates:

by Julien Langlois CC BY-SA 3.0

tomeu@arm-64:~/mesa$ LD_PRELOAD=libtensorflow_lite.so python3.10 class_device.py -i hen.bmp -m mobilenet_v1_0.25_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.so
Loading external delegate from build/src/gallium/targets/teflon/libteflon.so with args: {}
tflite_plugin_create_delegate
Teflon delegate: loaded etnaviv driver
INFO: Initialized TensorFlow Lite runtime.
PrepareDelegate
VERBOSE: Replacing 27 out of 31 node(s) with delegate (Teflon Delegate) node, yielding 2 partitions for the whole graph.
0.960784: hen
0.015686: cock
0.007843: goose
0.003922: Pembroke
0.003922: Ibizan hound
time: 22.802ms
tflite_plugin_destroy_delegate

This matched bit by bit the output from the blob, even if I was doing some tensor operations by hand, on the CPU. That also causes it to run far too slowly. We should be able to get that down to around 5ms once we learn how to drive the TP units for tensor manipulation.

Presented this work at Embedded Recipes 2023

Tired of only writing about all this in this blog, I took the chance given to me by Kevin Hilman to present it in front of a captive audience.


You can find the slides here, and listen to the talk at:



Next steps

The previous update got more in deep into what is left to do in the medium term, so I will just mention what I plan to do in the immediate future:

  1. Get input and output channels working at the 512 level, so we can run a higher accuracy version of the MobileNet V1 network
  2. Learn to use the TP units to remove those costly transpositions and reshuffles in the CPU (at this point, we would have something useful to people on the field)
  3. Upstream changes to the Linux kernel
  4. Propose Teflon to the Mesa folks

September 26, 2023

Progress

With the kids back in school I have been able to work on the Vivante VIP NPU driver full-time during the two weeks after the last update, with quite some work coming out of the pipeline:

Found the problem with enabling the 8th NN core

Though I don't know exactly yet what the problem is, I found that by going back to a previous brute-force approach to powering up the NPU, the 8th core works just fine.

For now this unblocks the work and gets me closer to the initial goal of running a MobileNetv1 inference and seeing what the performance is like, so I'm leaving a proper fix for this for later.

I bet there's either a register that is being written in the wrong order, or a delay between register writes that is too short. Will have to delve into the power domain subsystem and/or the common clock framework in the Linux kernel to fix this one.

Added support for depthwise convolutions

MobileNetV1 introduced Separable Depthwise Convolutions (see the linked paper for an in-depth description), which are layers that contain a depthwise convolution to process each depth level separately, plus a pointwise convolution to rejoin them again. This offers the same result with 23x less multiplications, so it's very attractive for mobile use-cases.

This hardware doesn't support depthwise convolutions directly, but we can lower them to regular convolutions after modifying the weight tensor to cover each IFM/depth separately.

Added support for pointwise convolutions

For the second half of a Separable Depthwise Convolution, I just had to take into account that 1x1 kernels are packed in a different format in memory, as otherwise it would be very inefficient for each NN core to pull each 1-byte kernel separately from the memory bus.

Added support for unsigned weights

TensorFlow Lite has moved towards implementing a new quantization specification which gives preference to signed weights because of convenience, as symmetric quantization is simpler to implement. Unfortunately for us, our hardware works natively with unsigned weights so we would need to convert them if we were to use TFLite's new quantization.

But the models that Google themselves publish make use of the ancient tooling that still support the old, unsigned quantization scheme, so I had to find a way of producing models with unsigned quantization for our test suite, to match what MobileNetV1 does.

That also implied moving to per-tensor quantization, instead of per-axis.

Added support for higher IFMs and OFMs (up to 256 each)

In the previous update I explained how support for multiple input and output channels (or feature maps) was added, but I wasn't able to test with more than 7 output channels because the 8th NN core was MIA.

With that solved, I was able to see what would be needed for convolutions with higher channel counts, such as those that MobileNetV1 use (32, 64, 128, 256, 512 and 1024).

Each level implied revisiting the tiled format in which weights and biases are laid out in memory, making it more and more complex.

I got to 256, with 512 and 1024 bringing more changes in the tiled format that I still need to reverse engineer.


Next steps

Model partition compilation and resource management

I'm facing problems with testing coverage as we support so many different parameters that need to be tested in combination, with a explosion in the number of individual tests. Because of the hacky current state of the TFLite delegate (and Gallium state tracker) I'm not able to run all the tests because I don't have proper resource management implemented and so we reach OOM before the end.

So my next task after I get back from Embedded Recipes will be to refactor the delegate implementation so we have a proper compilation of the model partitions. These will own the weight+bias buffers as well as the intermediate tensors, with each inference just feeding an input tensor to the partition and retrieving an output tensor at the end.

This will allow me to scale up the automated testing further, so I can keep adding new features with confidence, knowing that I'm not adding regressions.

Move development to Cottonwood A311D board

Da Xue of LibreComputer has got Etnaviv and Teflon working on the new boards that his company is releasing soon. One of them contain a A311D SoC, the same as the VIM3 I'm currently using for development. I will be initially targeting that one, and later make sure that it also works on the Cottonwood boards that will have the S905D3 SoC, which has a VIP Pico instead of a VIP Nano.

Besides being in general a great FOSS champion and specifically being supportive of ML inference with open source, Da is directly sponsoring this work, so I look forward to meet him in Paris this week and exchange notes.

Bigger coefficient tensors

The last known features missing before being able to run MobileNetV1 are IFMs and OFMs of 512 and 1024, each.

Hopefully it will only require some further tweaking of the tiled memory representation of the coefficient buffer.

Medium term goals

I don't expect performance to be that great yet, so I plan on switching the focus to it after the above has been accomplished. I expect for the features below making the most impact in improving performance:
  1. Avoid copies in and out of the model partition, by mapping user buffers to the NPU
  2. Use the TP units for tensor manipulation (transposing, mostly)
  3. Properly configuring the automatic caching of kernels and images in the internal on-chip SRAM
  4. Use the external SRAM for intermediate tensor data
  5. Chain all TP and NN jobs in a model partition in the same command stream
  6. Enable zero-run-length compression in the coefficient buffer
  7. Tune the tiling parameters for reduced memory bandwidth usage