Kerfuffle with AMD's amdgpu-install tool

Published:

Last updated:

Tags: Linux, AMD, graphics, bug

I'm the owner of an AMD RX580 GPU. On Linux, it's using the open-source amdgpu driver and has been pretty nice for the occasional gaming session and playing with Blender, even on my aging desktop (originally built in 2014 starring an Intel i5-4570 CPU!). One thing my poor desktop is showing its age for is image processing. I sometimes edit raw pictures in the wonderful darktable, and some modules are very heavy on the CPU. The good news is darktable comes with support for OpenCL. The bad news is although my GPU supports OpenCL, it's quite complicated to get it working on Linux. AMD gave up on providing open-source OpenCL support for this generation of GPU (and older). Instead, users have to rely on the legacy, proprietary drivers provided by AMD1.

Since it's possible to only install the OpenCL stack with the amdgpu-install tool, I figured I could give it a try.

Unfortunately for me, I timed it very badly: Ubuntu 22.04.3 had been released the day before, bringing a new version of the Linux kernel (6.2) and so my computer's kernel had been upgraded accordingly. The latest version of the AMD proprietary drivers had been released on July 31st and only supported Ubuntu 22.04.2… So, of course, when I tried to install it, it miserably failed when trying to install its DKMS.

I used the amdgpu-uninstall script to remove the packages installed by the AMD proprietary drivers, and called it a day. I could always revisit this when AMD would release a version of their drivers compatible with Ubuntu 22.04.3.

But the next day…

The next day, when I booted my desktop, I was back in 1998. The login screen was using a 1024×768 resolution. Same issue once logged in, and there were no other available resolutions in sight in GNOME Settings.

I checked the content of the system journal for the last boot (journalctl -b0), and saw something strange:

(EE) open /dev/dri/card0: No such file or directory

(EE) stands for an error in X.org. /dev/dri/card0 is the device normally created by the amdgpu driver… why wasn't it there?!

On Ubuntu, there is another interesting log file: /var/log/gpu-manager.log. It lists a bunch of things in a human-friendly format. A few strange things appeared:

Is amdgpu loaded? no
Is amdgpu blacklisted? yes
Is amdgpu versioned? no
Is amdgpu pro stack? no
(...)
Error: can't access /sys/bus/pci/devices/0000:01:00.0/driver
The device is not bound to any driver.
Error : Failed to open /dev/dri

radeon is the open-source driver used for pretty old AMD GPUs. "Recent" AMD GPUs (as in, less than 10 years old) are all compatible with the amdgpu driver. So why was amdgpu not loaded, and even blacklisted?!

Side note: I find the whole AMD naming circus extremely confusing. RX580, Polaris, Vega, GCN, RDNA… It's so confusing than Wikipedia has a whole article to track each generation of GPUs and all the codenames attached to them.

After poking around and chatting around, my colleague Daniel suggested to have a look at the list of blacklisted modules in /etc/modprobe.d/. And, sure enough, there was a blacklist-amdgpu.conf containing the dreaded blacklist amdgpu!

I deleted this file, rebooted, and my system booted into a GNOME session in full 4k glory!

Apprently, the cause of all this is a bug in the amdgpu-uninstall script…

Now, /var/log/gpu-manager.log is happier:

Is amdgpu loaded? yes
Is amdgpu blacklisted? no
(...)
Found "/dev/dri/card0", driven by "amdgpu"
output 0:
    card0-DP-2
Number of connected outputs for /dev/dri/card0: 1

So here we go. When facing this kind of weird issues on Linux, the first step is almost always to have a look at the system journal. And the second step is either to have a great colleague, either to have good search-fu skills! 😆

Edit: I opened an issue in their public bugtracker (#2800).