I Have No Sound, and I Must Hear: A Story of a Kernel Regression

#linux #kernel #debugging #hardware

TL;DR: After a kernel update, my 15-year-old laptop lost sound with CORB reset timeout and CORBRP = 65535 (0xFFFF). Reinstalling audio servers didn't help. GRUB parameters were ignored. The fix? Force a PCI rescan via sysfs and automate it with a systemd service. Here's the full debugging journey.

The Prologue

I use an old laptop — an Emachines E732. It’s so ancient that its maximum RAM is 8GB. Yet, it remains my reliable workhorse.

Everything was fine until a routine sudo pacman -Syu on CachyOS. Suddenly, the sound was gone. Reinstalling pulseaudio or pipewire did nothing. I hopped over to EndeavourOS, where it worked... until I switched to the linux-zen kernel. Then it died again.

Eventually, the sound vanished even on LTS kernels and LiveCDs. I almost gave up and bought a USB sound card, but then I realized: a workaround must exist.

The Diagnosis

Standard tools were useless. aplay -l was laconic:

aplay: device_list:274: no soundcards found...

However, inxi -Axxz confirmed the hardware was still there:

Audio:
  Device-1: Intel 5 Series/3400 Series High Definition Audio
  vendor: Acer Incorporated ALI driver: snd_hda_intel v: kernel
  bus-ID: 00:1b.0 chip-ID: 8086:3b56

And journalctl showed the first signs of trouble:

kernel: hdaudio hdaudioC0D0: no AFG or MFG node found
kernel: hdaudio hdaudioC0D1: no AFG or MFG node found

The system knew the card existed but couldn't talk to it. Here’s what didn't work:

User-space fixes: Reinstalling PipeWire or tweaking alsamixer. If ALSA can't see the card, software sliders are useless.
Module parameters: Adding model=auto or enable=1 to modprobe.d. These change driver behavior but can't fix PCI resource allocation errors.
Kernel parameters: pci=nocrs, pci=nomsi, pcie_aspm=off in GRUB. The kernel stubbornly ignored these crutches.

Deep Dive into dmesg (The 0xFFFF Error)

The truth was buried in dmesg. When the snd_hda_intel driver tries to wake up, it hits a wall:

snd_hda_intel 0000:00:1b.0: CORB reset timeout#2, CORBRP = 65535
snd_hda_intel 0000:00:1b.0: no codecs initialized

What is CORB and why does it matter?

CORB (Command Output Ring Buffer) is the ring buffer used by the driver to send commands to the audio codec.

Timeout means the hardware isn't responding to a "reset" command.
CORBRP = 65535 (0xFFFF) is the key. In the hardware world, "all ones" usually means the bus is hung or the device is physically inaccessible/unpowered.

Trying to reload the module manually (modprobe -r / modprobe) just produced the same errors with different timestamps. The device was stuck.

Eureka: The Software "Hot-swap"

I noticed that during a cold boot, the kernel assigned the card an address in the lower memory range: BAR 0 [mem 0xd4400000-0xd4403fff 64bit].

I decided to try a "software hot-plug" via sysfs:

Remove the device from the PCI tree.
Force the kernel to rescan the bus.

# Drop the card from the system
echo 1 | sudo tee /sys/bus/pci/devices/0000:00:1b.0/remove
sleep 2
# Rescan the PCI bus
echo 1 | sudo tee /sys/bus/pci/rescan

It worked. The dmesg output changed. Upon rescan, the kernel assigned a new memory address (BAR 0):

pci 0000:00:1b.0: BAR 0 [mem 0x23c000000-0x23c003fff 64bit]: assigned
snd_hda_intel 0000:00:1b.0: bound 0000:00:02.0
input: HDA Intel MID Mic as /devices/pci0000:00/.../input24

Before: mem 0xd4400000 (lower memory, high chance of conflicts with old BIOS).
After: mem 0x23c000000 (moved above the 4GB boundary).

The Realtek ALC272 codec finally responded.

State	BAR 0 Address	Driver Status	Result
Cold Boot	`0xd4400000`	`CORB reset timeout`	Silence
After Rescan	`0x23c000000`	`assigned / bound`	Sound!

Automation

I first tried a udev rule, but it failed due to a race condition: udev caught the "add" event and tried to "remove" the card while the kernel was still struggling to initialize it.

The stable solution is a systemd oneshot service: /etc/systemd/system/fix-audio.service

[Unit]
Description=Fix Audio PCI Rescan
After=multi-user.target

[Service]
Type=oneshot
ExecStartPre=/usr/bin/sh -c 'echo 1 > /sys/bus/pci/devices/0000:00:1b.0/remove'
ExecStart=/usr/bin/sh -c 'sleep 2 && echo 1 > /sys/bus/pci/rescan'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Conclusion

LTS doesn't guarantee the absence of regressions. Sometimes, progress (new power management or resource allocation logic) forgets about 15-year-old chipsets.

Key takeaways:

0xFFFF in logs is often a cry for help from the PCI bus, not hardware death.
Addresses matter. Moving a device above 4GB can bypass BIOS resource conflicts.
Don't rush to buy USB dongles. Sometimes Linux just needs you to "unplug and replug" the device via software.

FAQ

Why not a kernel bug report?
The hardware is 15 years old. A proper report requires a kernel bisect (dozens of re-compilations). On an 8GB laptop, that's torture. I chose the efficient workaround: a 5-minute script vs. weeks of compilation for a fix that might never be merged.

Maybe the hardware is just dying?
The logs say otherwise. If it were dying, it wouldn't work stably after a rescan — it would fail randomly or not be detected at all. This is a classic software/resource allocation issue.