His AI Said 'Swap the PSU.' He Said 'One More Test.'

Luis Enrique Otero Jiménez — Thu, 04 Jun 2026 07:08:11 +0000

How a homelab engineer and his AI pair-debugger cornered an RTX 3090 that hard-reset the entire machine the instant it ran inference — and why neither of them could have solved it alone.

A crash with no body

The first thing Marco noticed was the silence.

Not an error. Not a kernel panic scrolling up the screen. Not even a flicker in the logs. Just — the machine, gone. One moment his homelab box was answering an embeddings request for the little self-hosted knowledge base he'd been building; the next, the fans spun down, the screen went black, and the box rebooted as if someone had yanked the cord.

He did what any engineer does. He went to read the logs.

There were none.

journalctl showed a clean boot, then nothing, then the next clean boot. No Xid. No NVRM. No call trace. The kernel ring buffer had no last words. Whatever killed the machine had killed it so completely that the CPU never got to write a single line to disk.

A crash with no body. And it happened every single time he asked the GPU to think.

Fig. II — The machine that left no record: a hard reset that wrote nothing, anywhere.

The machine, and why it mattered

The box was nothing exotic: a single RTX 3090, 24 GB of VRAM, Ubuntu, and a stack of local models served through llama.cpp. Marco ran everything on-prem on purpose — a 26B chat model and a 4B embedding model, feeding a personal notes-and-search system he was rebuilding from scratch. The whole point was that nothing left the box.

Which meant the embeddings service was the project. If the GPU died every time it embedded a sentence, the project was dead too.

And there was a second, quieter problem. For a long time the machine had been perfectly stable on the 590-series NVIDIA driver. Then a routine system update pulled the kernel forward, and the driver came with it — up to the 595 series. The crashes started after that.

The obvious move was to roll back to 590. Marco tried. It wasn't a driver swap; it was a trapdoor. The 590 module was only ever built for a kernel two versions behind where the update had landed, and the distro had already retired the 590 branch entirely. "Going back to 590" really meant pinning the kernel at an old release forever — no security updates, a frozen island he could never sail off of.

He didn't want an island. He wanted his machine. So the real task wasn't "revert." It was: make 595 work.

The ghost of a problem already solved

Here's the part that made everything harder: Marco had killed a crash on this exact machine before.

Months earlier, the same box had a different death — it would shut completely off under sustained inference. That one turned out to be brutally physical. An RTX 3090 doesn't draw power smoothly; it throws microsecond transient spikes that can momentarily hit nearly twice its rated draw. The original power supply's over-current protection saw those spikes, decided something was wrong, and cut the rails. A beefier PSU fixed it for good. Afterward the box happily pulled 350 W sustained without so much as a hiccup.

That fix was real. The PSU genuinely was the culprit, and swapping it genuinely solved it.

But it left a fingerprint on how Marco — and later his AI — would think. "Box dies under GPU load" now had an obvious prior: it's power again. Usually a good instinct. This time, a trap with the safety off.

Enter the co-debugger

Marco had been pair-debugging with an AI agent — the kind that can read his shell, write code, edit configs, and reason about systems out loud. He'd come to treat it less like a chatbot and more like a tireless junior engineer with an encyclopedic memory and zero ego about grunt work.

The first thing it did was solve the "crash with no body" problem — by refusing to rely on the body at all.

You cannot read the logs of a machine that reboots before it can flush them. So the agent set up out-of-band capture: it streamed the kernel log over the network to a second machine via netconsole — anywhere the dying box's last words might land instead of dying with it — and added a one-hertz telemetry trail (power, clocks, P-state, VRAM, PCIe), flushed to disk every line so the final sample before a reset couldn't be lost. It even verified the whole pipe end-to-end: test lines and a live kernel stream arrived intact at the second machine.

# On the box that crashes: stream kernel messages off-machine BEFORE the crash.
# netconsole=<srcport>@<src-ip>/<dev>,<dstport>@<listener-ip>/<gateway-or-listener-mac>
sudo modprobe netconsole \
  netconsole=6666@192.0.2.10/eth0,6666@192.0.2.20/aa:bb:cc:dd:ee:ff

# On the listener machine: catch it.
nc -u -l -p 6666 | tee netconsole-capture.log

# A telemetry trail that survives a hard reset: flush every sample to disk.
nvidia-smi --query-gpu=timestamp,power.draw,clocks.sm,pstate,memory.used,pcie.link.gen.current \
           --format=csv,noheader -l 1 \
| while IFS= read -r line; do
    printf '%s\n' "$line" >> /var/log/gpu-crash-trail.log
    sync   # force it to disk NOW — the box may not get another chance
  done

Then the agent did the second indispensable thing: it turned "random" into a deterministic trigger. With careful repetition it nailed down that the crash wasn't intermittent at all. It fired reliably on a real inference request — even a single one, even from the small 4 GB embedding model alone.

# The whole repro. Fire one real inference request; watch the box die.
curl -s http://127.0.0.1:8082/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input":"trigger the crash"}'

And here's the twist Marco didn't expect: the capture rig never caught the crash. Not once. Every time they pulled the trigger, the box died — and the netconsole stream, the one they'd just proven worked, fell dead silent at the instant of death. The on-disk trail caught only the calm before: the GPU idling cool at 41 °C, a modest 116 W, an unremarkable P-state — and then a clean reboot, no last line.

That silence was the first real clue. A capture you've verified works, catching absolutely nothing at the moment of failure, isn't a failed experiment — it's a result. It meant the machine was dying faster than the CPU could write a single character: no panic, no Xid, no driver complaint, because the kernel never got the chance. This wasn't software crashing the system. The system was being switched off from below the software — a hardware-level reset. The absence of evidence was the evidence, and it quietly demolished an entire class of theories before they began.

A reproducible crash isn't a solved crash. But it's the difference between hunting a ghost and running an experiment. For the first time, Marco felt like they were doing science instead of lighting candles.

Fig. III — A capture rig, verified working, recording the silence of a hardware-level reset.

The gauntlet

What followed was two nights of the agent methodically walking up to every hypothesis and shooting it in the head. Marco would propose; the agent would build the test, run it against the deterministic trigger, and read the result off the out-of-band trail.

One by one, they fell:

Hypothesis	Test	Result
GSP firmware bug (Ampere classic)	Disable it (`NVreg_EnableGpuFirmware=0`)	❌ crashed anyway
BAR1 / VA-space exhaustion (`open-gpu-kernel-modules` #1134)	Would emit `Xid 31/154`	❌ none ever captured
Thermal	Read junction temp at crash	❌ died at 41 °C
High core power / compute	cuBLAS burn at ~284 W	❌ stable 6+ minutes
Deep-idle cold-wake	Keepalive pinned at P5	❌ crashed at steady P5
VRAM pressure	Oversubscribe 24 GB+	❌ survived 30+ min
Power magnitude	Cap to 100 W (firmware floor)	❌ crashed anyway
Lost BIOS settings (the CMOS battery had been replaced)	Re-apply the entire power-management lever set, confirmed live	❌ crashed anyway

Fig. IV — A catalogue of falsified causes: every tunable lever, ruled out one by one.

That last one stung. The migration to the new case had reset the BIOS, and "a lost BIOS setting" was a beautiful theory — it would have explained the whole "stable, then not" arc. The agent restored every relevant lever — PCIe link speed pinned to Gen3, Resizable BAR off, ASPM off, clock gating off, C-states clamped — and confirmed each one was actually in effect. The box still hard-reset on a single embed request.

Meanwhile, every read-only health check came back pristine: zero PCIe replays, zero AER errors, a full Gen3 x16 link under load, no pending channel repairs, a healthy VBIOS. The card looked perfect. It just kept committing suicide whenever it thought.

"It's hardware. Swap the PSU."

Here's where the story turns, and where it gets honest.

After the gauntlet, the agent reached a conclusion — and it stated it clearly, more than once, with a genuinely solid argument behind it:

Software is exhausted. Every tunable lever has been falsified. This is a hardware fault. The next steps are physical: reseat the GPU power cables with no daisy-chaining, then swap or test the PSU, then cross-test the card in another machine.

And you can see why. They'd ruled out firmware, thermals, VRAM, BIOS, and raw power level. The crash left no software trace. The machine had a real power-delivery fault in its past. Pattern-match complete: it's power again, go physical.

It was the reasonable conclusion. It was sound given the evidence.

It was also wrong.

This is the moment that decides these investigations. The tooling had done everything right and pointed confidently at the door marked Buy New Hardware. Marco's hand was on the screwdriver. Reseating cables, swapping a known-good PSU, eventually RMA'ing a card — days of work and real money, on the word of a very convincing diagnosis.

He stopped. Something nagged. The cuBLAS burn had pulled 284 watts of pure compute for six minutes and never flinched — but a tiny embedding request, drawing a fraction of that, killed the box instantly. If this were a gross power-delivery fault, the brutal sustained burn should have been more dangerous, not less. The magnitude story didn't fit.

So instead of asking "how do I fix the hardware," he asked a different question: "are we even sure this is about inference being heavy? What if it's about inference being weird?"

Not "fix it." Characterize it.

One more test

The agent took the new framing and ran with it — and this is the other half of the story, the half where the human alone is helpless. Marco could have the instinct that "heavy vs. weird" mattered. He could not, on a side project at 1 a.m., have hand-written a suite of raw-CUDA reproducers to prove it. The agent could, and did, in minutes.

The idea: strip away llama.cpp entirely and probe the GPU with pure, hand-shaped CUDA workloads, each isolating one flavor of load.

// burn.cu — sustained, SMOOTH cuBLAS SGEMM. ~280W of clean compute.
// Build: nvcc -O3 burn.cu -lcublas -o burn
#include <cublas_v2.h>
#include <cuda_runtime.h>

int main() {
    const int n = 8192;
    size_t bytes = (size_t)n * n * sizeof(float);
    float *A, *B, *C;
    cudaMalloc(&A, bytes); cudaMalloc(&B, bytes); cudaMalloc(&C, bytes);

    cublasHandle_t h; cublasCreate(&h);
    const float alpha = 1.f, beta = 0.f;

    for (long i = 0; ; ++i) {            // run until it crashes — or you give up waiting
        cublasSgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
                    &alpha, A, n, B, n, &beta, C, n);
        cudaDeviceSynchronize();
    }
}

They built variations: sustained bursty SGEMM with 90-watt swings; a cold-burst version that slammed from warm-idle to full power with razor-sharp di/dt edges; a PCIe stress test that saturated the bus at 12.7 GB/s. Each one targeted a specific bogeyman — power swings, current slew rate, bus traffic.

Every one of them survived.

Then the decisive pair. The agent ran the real embedding model two ways. The difference was a single flag — --n-gpu-layers, how much of the model lives on the GPU versus shuttling between CPU and GPU per layer:

Workload	Peak power	Result
Bursty SGEMM (smooth compute)	277 W	✅ survives
Cold-burst, sharp di/dt	122 → 278 W	✅ survives
PCIe saturation	12.7 GB/s	✅ survives
Embedding, full offload (`-ngl 99`)	286 W	✅ survives, 60 requests, rock-solid
Embedding, partial offload (`-ngl 20`)	low	❌ instant reset, every time

There it was. Same card. Same model. Higher power on the stable config. The only thing that reliably killed the machine was partial offload — and it died on a single request while drawing far less than the burn that ran for six minutes.

Fig. V — The shape of the load, not its size: full offload runs smooth; partial offload stutters and kills the box.

What the crash actually was

The signature finally made sense.

Partial offload doesn't produce a heavy, smooth load. It produces a stutter: the GPU spins up to compute one small per-layer operation, then stalls waiting on a PCIe round-trip to fetch the next layer from system RAM, then spins up again — a high-frequency sawtooth of micro-bursts synced to bus latency. Full offload keeps the whole model resident and runs a smooth, continuous stream. Pure cuBLAS is smoother still.

So the killer wasn't power magnitude (the 100 W cap crashed; 286 W didn't). It wasn't generic di/dt (the cold-burst test was sharper and survived). It wasn't PCIe bandwidth (saturation survived). It wasn't compute. It was the specific waveform of partial-offload inference — fine compute micro-bursts interleaved with PCIe-synced idle, a pattern nothing else they'd thrown at the card reproduced.

Most likely: either a marginal, almost resonant power-delivery transient that only that particular stutter excites — or a power-management bug in the 595 driver around rapid P-state transitions on Ampere. Possibly both, feeding each other. The "swap the PSU" verdict wasn't crazy; it was just aimed at the wrong layer. The PSU could deliver the energy fine. It was the shape of the demand that nobody's hardware or firmware liked.

The fix that let him keep his machine

The workaround fell straight out of the diagnosis: never run the model in partial offload. Put the whole embedding model on the GPU and don't co-load the big chat model that had been forcing the cramped VRAM budget in the first place.

- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial → the crash waveform
+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → smooth, stable

The validation wasn't shy. The agent fired sustained, concurrent embedding load at the server: 2,272 requests, zero failures, peak 312 watts, not a single crash. Higher power than the workload that used to nuke the box on request one — and it didn't even blink. Marco enabled the service permanently.

No PSU swap. No RMA. No frozen-island kernel downgrade. The machine stayed on the maintained 595 driver, and the project that depended on it came back to life.

Fig. VI — 2,272 requests at 312 W, without fault.

The honest part

The root cause is still not nailed. The workaround sidesteps the failure mode; it doesn't explain it down to the transistor. The clean next experiment — boot the known-good 590 driver and re-run the partial-offload trigger — would finally separate "595 driver bug" from "marginal hardware transient." Marco left it for another night. A characterized, validated workaround that lets him keep shipping beats a perfect post-mortem he doesn't have time to write.

That's allowed. Engineering is not forensics. Sometimes "I know exactly how to avoid it and I've proven the avoidance holds 2,272 times" is the right place to stop.

Why it took both of them

It's tempting to tell this as "the AI cracked a hard bug." It's also tempting to tell it as "the AI was wrong and the human saved the day." Both are too neat. The truth is more interesting, and more useful.

Without the agent, this was unsolvable on a side project. The out-of-band capture rig, the deterministic trigger, the entire falsification gauntlet, and — above all — a suite of bespoke raw-CUDA reproducers conjured on demand at midnight: that is days of specialist work, compressed into hours, executed without fatigue or shortcuts. No solo hobbyist grinds through all of that. Most would have swapped parts on a guess and either gotten lucky or given up.

And without the human, the agent would have bought a power supply it didn't need. Its "it's hardware, go physical" verdict was well-reasoned and confidently delivered — and premature. What broke the case wasn't more analysis. It was a human refusing the confident conclusion, sitting with a detail that didn't fit (six minutes at 284 W, but dead on a 4 W embed?), and changing the question from fix it to characterize it.

The agent was the instrument: precise, tireless, encyclopedic. The human was the one who held the line when the instrument pointed at the wrong door. Neither half solves this. The pairing does.

Fig. VII — Neither alone: the instrument and the judge.

Takeaways (for humans and agents alike)

A hard reset with zero kernel log is a hardware-level reset — the CPU never reached a printk. Don't waste days grepping dmesg for a fault that never gets to speak.
Capture out-of-band, first — and remember that silence is data. A box that reboots destroys its own evidence, so stream the kernel log to another machine and flush telemetry every line. But if a capture you've verified works still catches nothing at the moment of the crash, that isn't a failed setup — it's proof the fault lives below the software, where no printk can reach.
Get a deterministic trigger before you theorize. "Intermittent" is a hunt; "fires on this exact request" is an experiment.
Falsify one variable at a time. The multi-variable migration (driver + kernel + cables + BIOS, all at once) is exactly what makes "X is stable, Y crashes" correlations lie to you.
Separate the waveform from the magnitude. The single most decisive move here was proving that smooth 286 W was fine while a stuttery few-watt load was fatal. Shape, not size.
A past fix is a prior, not a verdict. The PSU really was the culprit once. That history made "it's power again" feel obvious — and obvious was wrong.
Treat any single confident diagnosis — yours or your AI's — as a hypothesis, not a conclusion. The best tooling in the world can be soundly, persuasively early.

TL;DR — purely technical

Setup

GPU: NVIDIA RTX 3090, 24 GB
OS / kernel: Ubuntu 24.04, HWE kernel 6.17.0-35
Driver: NVIDIA 595.71.05 (proprietary). Last known-good: 590 — but it's only built for kernel 6.17.0-22 and the branch is retired, so reverting means pinning an old kernel permanently (no security updates).
Inference: llama.cpp (llama-server)
Models: Qwen3-Embedding-4B (Q8_0, ~4 GB) on its own server, co-loaded with a 26B chat model. The two together left no headroom in 24 GB → the embed model ran at --n-gpu-layers 20 (partial offload).
Board / PSU: ASUS Z390, NZXT C1200. (A prior Corsair RM850x genuinely tripped OCP on 3090 transients — a separate, earlier fault already fixed by that PSU swap; not this bug.)

Symptom

Instant, full-system hard reset (like a power cut) the moment a real inference request runs — even a single request to the 4 GB embed model alone.
Zero kernel log: no Xid, NVRM, panic, MCE, or PCIe AER. A verified-working netconsole + fsync'd on-disk telemetry caught nothing at the crash ⇒ hardware-level reset; the CPU never reaches a printk.
Steady state right before death: ~41 °C, P2, ~116 W. Deterministic, not intermittent.

Diagnostic — falsified, each with direct evidence

GSP firmware off (NVreg_EnableGpuFirmware=0) → still crashed.
BAR1 / VA-exhaustion (open-gpu-kernel-modules #1134) → would emit Xid 31/154; none ever captured. pci=realloc=off couldn't shrink BAR1 (kernel forces 32 GB at enumeration).
Thermal → crashed at 41–42 °C.
Power magnitude → 100 W cap (firmware floor) still crashed; yet a cuBLAS SGEMM burn at 284 W ran 6+ min stable.
VRAM → 24 GB+ oversubscription survived 30+ min.
BIOS (CMOS-reset suspicion) → full lever set applied & confirmed live (PEG link → Gen3, ReBAR off → BAR1 256 MiB, ASPM off, PCIe clock gating off, C-states clamped) → still crashed.
Card health (read-only): 0 PCIe replays, 0 AER, full Gen3 x16 under load, no Xid, Channel Repair Pending: No.
Isolation (raw CUDA, pure cuBLAS — no llama/ggml): bursty SGEMM (277 W), cold-burst di/dt (122→278 W), PCIe saturation (12.7 GB/s) → all survived. llama embed full offload -ngl 99 (286 W, 60 req) → survived. llama embed partial offload -ngl 20 → instant reset on one request.

Conclusion

Trigger is the partial-offload inference waveform — fine per-layer compute micro-bursts interleaved with PCIe-synced stalls — not power magnitude, di/dt, PCIe bandwidth, or compute. Root cause not definitively isolated: a marginal power-delivery transient vs. a 595-series power-management bug on rapid Ampere P-state transitions. Re-testing on driver 590 would discriminate (not yet done).

Workaround

- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial offload → crashes
+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → stable

Keep the embed model fully GPU-resident; don't co-load the 26B chat (it was what forced partial offload on 24 GB).
Validated: 2,272 concurrent requests, peak 312 W, zero crashes; enabled permanently as a systemd user service.
Stays on the maintained 595 driver — no PSU swap, no RMA, no kernel pin.

Originally published on dev.to.

DEV Community: Luis Enrique Otero Jiménez