<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luis Enrique Otero Jiménez</title>
    <description>The latest articles on DEV Community by Luis Enrique Otero Jiménez (@lenriqueotero).</description>
    <link>https://dev.to/lenriqueotero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3967671%2Fa93ec7a3-307c-4ca4-9aa9-998e0576544f.png</url>
      <title>DEV Community: Luis Enrique Otero Jiménez</title>
      <link>https://dev.to/lenriqueotero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lenriqueotero"/>
    <language>en</language>
    <item>
      <title>His AI Said 'Swap the PSU.' He Said 'One More Test.'</title>
      <dc:creator>Luis Enrique Otero Jiménez</dc:creator>
      <pubDate>Thu, 04 Jun 2026 07:08:11 +0000</pubDate>
      <link>https://dev.to/lenriqueotero/his-ai-said-swap-the-psu-he-said-one-more-test-2i7g</link>
      <guid>https://dev.to/lenriqueotero/his-ai-said-swap-the-psu-he-said-one-more-test-2i7g</guid>
      <description>&lt;p&gt;&lt;em&gt;How a homelab engineer and his AI pair-debugger cornered an RTX 3090 that hard-reset the entire machine the instant it ran inference — and why neither of them could have solved it alone.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A crash with no body
&lt;/h2&gt;

&lt;p&gt;The first thing Marco noticed was the silence.&lt;/p&gt;

&lt;p&gt;Not an error. Not a kernel panic scrolling up the screen. Not even a flicker in the logs. Just — the machine, gone. One moment his homelab box was answering an embeddings request for the little self-hosted knowledge base he'd been building; the next, the fans spun down, the screen went black, and the box rebooted as if someone had yanked the cord.&lt;/p&gt;

&lt;p&gt;He did what any engineer does. He went to read the logs.&lt;/p&gt;

&lt;p&gt;There were none.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;journalctl&lt;/code&gt; showed a clean boot, then nothing, then the next clean boot. No &lt;code&gt;Xid&lt;/code&gt;. No &lt;code&gt;NVRM&lt;/code&gt;. No call trace. The kernel ring buffer had no last words. Whatever killed the machine had killed it so completely that the CPU never got to write a single line to disk.&lt;/p&gt;

&lt;p&gt;A crash with no body. And it happened &lt;em&gt;every single time&lt;/em&gt; he asked the GPU to think.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvoe22noeqvw714br4cyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvoe22noeqvw714br4cyz.png" alt="A dead electronic " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. II — The machine that left no record: a hard reset that wrote nothing, anywhere.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The machine, and why it mattered
&lt;/h2&gt;

&lt;p&gt;The box was nothing exotic: a single RTX 3090, 24 GB of VRAM, Ubuntu, and a stack of local models served through &lt;code&gt;llama.cpp&lt;/code&gt;. Marco ran everything on-prem on purpose — a 26B chat model and a 4B embedding model, feeding a personal notes-and-search system he was rebuilding from scratch. The whole point was that nothing left the box.&lt;/p&gt;

&lt;p&gt;Which meant the embeddings service &lt;em&gt;was&lt;/em&gt; the project. If the GPU died every time it embedded a sentence, the project was dead too.&lt;/p&gt;

&lt;p&gt;And there was a second, quieter problem. For a long time the machine had been perfectly stable on the &lt;strong&gt;590-series&lt;/strong&gt; NVIDIA driver. Then a routine system update pulled the kernel forward, and the driver came with it — up to the &lt;strong&gt;595 series&lt;/strong&gt;. The crashes started after that.&lt;/p&gt;

&lt;p&gt;The obvious move was to roll back to 590. Marco tried. It wasn't a driver swap; it was a trapdoor. The 590 module was only ever built for a kernel two versions behind where the update had landed, and the distro had already retired the 590 branch entirely. "Going back to 590" really meant &lt;em&gt;pinning the kernel at an old release forever&lt;/em&gt; — no security updates, a frozen island he could never sail off of.&lt;/p&gt;

&lt;p&gt;He didn't want an island. He wanted his machine. So the real task wasn't "revert." It was: &lt;strong&gt;make 595 work.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The ghost of a problem already solved
&lt;/h2&gt;

&lt;p&gt;Here's the part that made everything harder: Marco had killed a crash on this exact machine before.&lt;/p&gt;

&lt;p&gt;Months earlier, the same box had a different death — it would shut &lt;em&gt;completely off&lt;/em&gt; under sustained inference. That one turned out to be brutally physical. An RTX 3090 doesn't draw power smoothly; it throws microsecond transient spikes that can momentarily hit nearly twice its rated draw. The original power supply's over-current protection saw those spikes, decided something was wrong, and cut the rails. A beefier PSU fixed it for good. Afterward the box happily pulled 350 W sustained without so much as a hiccup.&lt;/p&gt;

&lt;p&gt;That fix was &lt;strong&gt;real&lt;/strong&gt;. The PSU genuinely was the culprit, and swapping it genuinely solved it.&lt;/p&gt;

&lt;p&gt;But it left a fingerprint on how Marco — and later his AI — would think. "Box dies under GPU load" now had an obvious prior: &lt;em&gt;it's power again.&lt;/em&gt; Usually a good instinct. This time, a trap with the safety off.&lt;/p&gt;
&lt;h2&gt;
  
  
  Enter the co-debugger
&lt;/h2&gt;

&lt;p&gt;Marco had been pair-debugging with an AI agent — the kind that can read his shell, write code, edit configs, and reason about systems out loud. He'd come to treat it less like a chatbot and more like a tireless junior engineer with an encyclopedic memory and zero ego about grunt work.&lt;/p&gt;

&lt;p&gt;The first thing it did was solve the "crash with no body" problem — by refusing to rely on the body at all.&lt;/p&gt;

&lt;p&gt;You cannot read the logs of a machine that reboots before it can flush them. So the agent set up &lt;strong&gt;out-of-band capture&lt;/strong&gt;: it streamed the kernel log over the network to a &lt;em&gt;second&lt;/em&gt; machine via &lt;code&gt;netconsole&lt;/code&gt; — anywhere the dying box's last words might land instead of dying with it — and added a one-hertz telemetry trail (power, clocks, P-state, VRAM, PCIe), flushed to disk &lt;em&gt;every line&lt;/em&gt; so the final sample before a reset couldn't be lost. It even verified the whole pipe end-to-end: test lines and a live kernel stream arrived intact at the second machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the box that crashes: stream kernel messages off-machine BEFORE the crash.&lt;/span&gt;
&lt;span class="c"&gt;# netconsole=&amp;lt;srcport&amp;gt;@&amp;lt;src-ip&amp;gt;/&amp;lt;dev&amp;gt;,&amp;lt;dstport&amp;gt;@&amp;lt;listener-ip&amp;gt;/&amp;lt;gateway-or-listener-mac&amp;gt;&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;modprobe netconsole &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;netconsole&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;6666@192.0.2.10/eth0,6666@192.0.2.20/aa:bb:cc:dd:ee:ff

&lt;span class="c"&gt;# On the listener machine: catch it.&lt;/span&gt;
nc &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 6666 | &lt;span class="nb"&gt;tee &lt;/span&gt;netconsole-capture.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# A telemetry trail that survives a hard reset: flush every sample to disk.&lt;/span&gt;
nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;timestamp,power.draw,clocks.sm,pstate,memory.used,pcie.link.gen.current &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader &lt;span class="nt"&gt;-l&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
| &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; line&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'%s\n'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/gpu-crash-trail.log
    &lt;span class="nb"&gt;sync&lt;/span&gt;   &lt;span class="c"&gt;# force it to disk NOW — the box may not get another chance&lt;/span&gt;
  &lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the agent did the second indispensable thing: it turned "random" into a &lt;strong&gt;deterministic trigger&lt;/strong&gt;. With careful repetition it nailed down that the crash wasn't intermittent at all. It fired &lt;em&gt;reliably&lt;/em&gt; on a real inference request — even a single one, even from the small 4 GB embedding model alone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The whole repro. Fire one real inference request; watch the box die.&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://127.0.0.1:8082/v1/embeddings &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"input":"trigger the crash"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here's the twist Marco didn't expect: the capture rig never caught the crash. Not once. Every time they pulled the trigger, the box died — and the &lt;code&gt;netconsole&lt;/code&gt; stream, the one they'd just &lt;em&gt;proven&lt;/em&gt; worked, fell dead silent at the instant of death. The on-disk trail caught only the calm before: the GPU idling cool at 41 °C, a modest 116 W, an unremarkable P-state — and then a clean reboot, no last line.&lt;/p&gt;

&lt;p&gt;That silence was the first real clue. A capture you've verified works, catching &lt;em&gt;absolutely nothing&lt;/em&gt; at the moment of failure, isn't a failed experiment — it's a result. It meant the machine was dying faster than the CPU could write a single character: no panic, no &lt;code&gt;Xid&lt;/code&gt;, no driver complaint, because the kernel never got the chance. This wasn't software crashing the system. The system was being switched off from &lt;em&gt;below&lt;/em&gt; the software — a hardware-level reset. The absence of evidence was the evidence, and it quietly demolished an entire class of theories before they began.&lt;/p&gt;

&lt;p&gt;A reproducible crash isn't a solved crash. But it's the difference between hunting a ghost and running an experiment. For the first time, Marco felt like they were &lt;em&gt;doing science&lt;/em&gt; instead of lighting candles.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0p2iyexvwx2l3rmcqwaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0p2iyexvwx2l3rmcqwaj.png" alt="Two engraved apparatus joined by a wire; the recording drum shows only a flat, silent trace" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. III — A capture rig, verified working, recording the silence of a hardware-level reset.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The gauntlet
&lt;/h2&gt;

&lt;p&gt;What followed was two nights of the agent methodically walking up to every hypothesis and shooting it in the head. Marco would propose; the agent would build the test, run it against the deterministic trigger, and read the result off the out-of-band trail.&lt;/p&gt;

&lt;p&gt;One by one, they fell:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GSP firmware bug (Ampere classic)&lt;/td&gt;
&lt;td&gt;Disable it (&lt;code&gt;NVreg_EnableGpuFirmware=0&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;❌ crashed anyway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BAR1 / VA-space exhaustion (&lt;code&gt;open-gpu-kernel-modules&lt;/code&gt; #1134)&lt;/td&gt;
&lt;td&gt;Would emit &lt;code&gt;Xid 31/154&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;❌ none ever captured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thermal&lt;/td&gt;
&lt;td&gt;Read junction temp at crash&lt;/td&gt;
&lt;td&gt;❌ died at &lt;strong&gt;41 °C&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High core power / compute&lt;/td&gt;
&lt;td&gt;cuBLAS burn at ~284 W&lt;/td&gt;
&lt;td&gt;❌ stable 6+ minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deep-idle cold-wake&lt;/td&gt;
&lt;td&gt;Keepalive pinned at P5&lt;/td&gt;
&lt;td&gt;❌ crashed at steady P5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM pressure&lt;/td&gt;
&lt;td&gt;Oversubscribe 24 GB+&lt;/td&gt;
&lt;td&gt;❌ survived 30+ min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power &lt;strong&gt;magnitude&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Cap to 100 W (firmware floor)&lt;/td&gt;
&lt;td&gt;❌ crashed anyway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lost BIOS settings (the CMOS battery had been replaced)&lt;/td&gt;
&lt;td&gt;Re-apply the &lt;em&gt;entire&lt;/em&gt; power-management lever set, confirmed live&lt;/td&gt;
&lt;td&gt;❌ crashed anyway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ydi3t5ft5qv95ppnflp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ydi3t5ft5qv95ppnflp.png" alt="Six framed specimen vignettes, each struck through with an X — heat, VRAM, power, firmware, BIOS, cables" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. IV — A catalogue of falsified causes: every tunable lever, ruled out one by one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That last one stung. The migration to the new case had reset the BIOS, and "a lost BIOS setting" was a beautiful theory — it would have explained the whole "stable, then not" arc. The agent restored every relevant lever — PCIe link speed pinned to Gen3, Resizable BAR off, ASPM off, clock gating off, C-states clamped — and confirmed each one was actually in effect. The box still hard-reset on a single embed request.&lt;/p&gt;

&lt;p&gt;Meanwhile, every read-only health check came back &lt;em&gt;pristine&lt;/em&gt;: zero PCIe replays, zero AER errors, a full Gen3 x16 link under load, no pending channel repairs, a healthy VBIOS. The card looked perfect. It just kept committing suicide whenever it thought.&lt;/p&gt;
&lt;h2&gt;
  
  
  "It's hardware. Swap the PSU."
&lt;/h2&gt;

&lt;p&gt;Here's where the story turns, and where it gets honest.&lt;/p&gt;

&lt;p&gt;After the gauntlet, the agent reached a conclusion — and it stated it clearly, more than once, with a genuinely solid argument behind it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Software is exhausted. Every tunable lever has been falsified. This is a hardware fault. The next steps are physical: reseat the GPU power cables with no daisy-chaining, then swap or test the PSU, then cross-test the card in another machine.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And you can see why. They'd ruled out firmware, thermals, VRAM, BIOS, and raw power level. The crash left no software trace. The machine &lt;em&gt;had&lt;/em&gt; a real power-delivery fault in its past. Pattern-match complete: &lt;strong&gt;it's power again, go physical.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It was the reasonable conclusion. It was &lt;em&gt;sound given the evidence.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It was also wrong.&lt;/p&gt;

&lt;p&gt;This is the moment that decides these investigations. The tooling had done everything right and pointed confidently at the door marked &lt;em&gt;Buy New Hardware&lt;/em&gt;. Marco's hand was on the screwdriver. Reseating cables, swapping a known-good PSU, eventually RMA'ing a card — days of work and real money, on the word of a very convincing diagnosis.&lt;/p&gt;

&lt;p&gt;He stopped. Something nagged. The cuBLAS burn had pulled &lt;strong&gt;284 watts of pure compute for six minutes and never flinched&lt;/strong&gt; — but a tiny embedding request, drawing a fraction of that, killed the box instantly. If this were a gross power-delivery fault, the brutal sustained burn should have been &lt;em&gt;more&lt;/em&gt; dangerous, not less. The magnitude story didn't fit.&lt;/p&gt;

&lt;p&gt;So instead of asking &lt;em&gt;"how do I fix the hardware,"&lt;/em&gt; he asked a different question: &lt;strong&gt;"are we even sure this is about inference being heavy? What if it's about inference being &lt;em&gt;weird&lt;/em&gt;?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "fix it." &lt;strong&gt;Characterize it.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  One more test
&lt;/h2&gt;

&lt;p&gt;The agent took the new framing and ran with it — and this is the other half of the story, the half where the human alone is helpless. Marco could have the &lt;em&gt;instinct&lt;/em&gt; that "heavy vs. weird" mattered. He could not, on a side project at 1 a.m., have hand-written a suite of raw-CUDA reproducers to prove it. The agent could, and did, in minutes.&lt;/p&gt;

&lt;p&gt;The idea: strip away &lt;code&gt;llama.cpp&lt;/code&gt; entirely and probe the GPU with pure, hand-shaped CUDA workloads, each isolating one &lt;em&gt;flavor&lt;/em&gt; of load.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// burn.cu — sustained, SMOOTH cuBLAS SGEMM. ~280W of clean compute.&lt;/span&gt;
&lt;span class="c1"&gt;// Build: nvcc -O3 burn.cu -lcublas -o burn&lt;/span&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;cublas_v2.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;cuda_runtime.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;cudaMalloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;cublasHandle_t&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cublasCreate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;beta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;            &lt;span class="c1"&gt;// run until it crashes — or you give up waiting&lt;/span&gt;
        &lt;span class="n"&gt;cublasSgemm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUBLAS_OP_N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUBLAS_OP_N&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;cudaDeviceSynchronize&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They built variations: sustained bursty SGEMM with 90-watt swings; a &lt;em&gt;cold-burst&lt;/em&gt; version that slammed from warm-idle to full power with razor-sharp &lt;code&gt;di/dt&lt;/code&gt; edges; a PCIe stress test that saturated the bus at 12.7 GB/s. Each one targeted a specific bogeyman — power swings, current slew rate, bus traffic.&lt;/p&gt;

&lt;p&gt;Every one of them &lt;strong&gt;survived&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Then the decisive pair. The agent ran the &lt;em&gt;real&lt;/em&gt; embedding model two ways. The difference was a single flag — &lt;code&gt;--n-gpu-layers&lt;/code&gt;, how much of the model lives on the GPU versus shuttling between CPU and GPU per layer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Peak power&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bursty SGEMM (smooth compute)&lt;/td&gt;
&lt;td&gt;277 W&lt;/td&gt;
&lt;td&gt;✅ survives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold-burst, sharp di/dt&lt;/td&gt;
&lt;td&gt;122 → 278 W&lt;/td&gt;
&lt;td&gt;✅ survives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PCIe saturation&lt;/td&gt;
&lt;td&gt;12.7 GB/s&lt;/td&gt;
&lt;td&gt;✅ survives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding, &lt;strong&gt;full offload&lt;/strong&gt; (&lt;code&gt;-ngl 99&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;286 W&lt;/td&gt;
&lt;td&gt;✅ survives, 60 requests, rock-solid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding, &lt;strong&gt;partial offload&lt;/strong&gt; (&lt;code&gt;-ngl 20&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;td&gt;❌ &lt;strong&gt;instant reset, every time&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There it was. Same card. Same model. &lt;em&gt;Higher&lt;/em&gt; power on the stable config. The only thing that reliably killed the machine was &lt;strong&gt;partial offload&lt;/strong&gt; — and it died on a single request while drawing far less than the burn that ran for six minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuljuxk9a1l129v3xm7if.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuljuxk9a1l129v3xm7if.png" alt="Two engraved oscilloscope traces: a smooth wave labeled full offload / stable, and a jagged stutter labeled partial offload / fault" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. V — The shape of the load, not its size: full offload runs smooth; partial offload stutters and kills the box.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What the crash actually was
&lt;/h2&gt;

&lt;p&gt;The signature finally made sense.&lt;/p&gt;

&lt;p&gt;Partial offload doesn't produce a heavy, smooth load. It produces a &lt;em&gt;stutter&lt;/em&gt;: the GPU spins up to compute one small per-layer operation, then stalls waiting on a PCIe round-trip to fetch the next layer from system RAM, then spins up again — a high-frequency sawtooth of micro-bursts synced to bus latency. Full offload keeps the whole model resident and runs a smooth, continuous stream. Pure cuBLAS is smoother still.&lt;/p&gt;

&lt;p&gt;So the killer wasn't power &lt;em&gt;magnitude&lt;/em&gt; (the 100 W cap crashed; 286 W didn't). It wasn't generic &lt;code&gt;di/dt&lt;/code&gt; (the cold-burst test was sharper and survived). It wasn't PCIe bandwidth (saturation survived). It wasn't compute. It was the specific &lt;strong&gt;waveform&lt;/strong&gt; of partial-offload inference — fine compute micro-bursts interleaved with PCIe-synced idle, a pattern nothing else they'd thrown at the card reproduced.&lt;/p&gt;

&lt;p&gt;Most likely: either a marginal, almost &lt;em&gt;resonant&lt;/em&gt; power-delivery transient that only that particular stutter excites — or a power-management bug in the 595 driver around rapid P-state transitions on Ampere. Possibly both, feeding each other. The "swap the PSU" verdict wasn't crazy; it was just aimed at the wrong layer. The PSU could deliver the &lt;em&gt;energy&lt;/em&gt; fine. It was the &lt;em&gt;shape&lt;/em&gt; of the demand that nobody's hardware or firmware liked.&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix that let him keep his machine
&lt;/h2&gt;

&lt;p&gt;The workaround fell straight out of the diagnosis: &lt;strong&gt;never run the model in partial offload.&lt;/strong&gt; Put the whole embedding model on the GPU and don't co-load the big chat model that had been forcing the cramped VRAM budget in the first place.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial → the crash waveform
&lt;/span&gt;&lt;span class="gi"&gt;+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → smooth, stable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The validation wasn't shy. The agent fired sustained, concurrent embedding load at the server: &lt;strong&gt;2,272 requests, zero failures, peak 312 watts, not a single crash.&lt;/strong&gt; Higher power than the workload that used to nuke the box on request one — and it didn't even blink. Marco enabled the service permanently.&lt;/p&gt;

&lt;p&gt;No PSU swap. No RMA. No frozen-island kernel downgrade. The machine stayed on the maintained 595 driver, and the project that depended on it came back to life.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m6ujmoceusd48lpduqr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m6ujmoceusd48lpduqr.png" alt="An engraved gauge reading 312 W beside a counter at 2272, next to the stable " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. VI — 2,272 requests at 312 W, without fault.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The honest part
&lt;/h2&gt;

&lt;p&gt;The root cause is still not &lt;em&gt;nailed&lt;/em&gt;. The workaround sidesteps the failure mode; it doesn't explain it down to the transistor. The clean next experiment — boot the known-good 590 driver and re-run the partial-offload trigger — would finally separate "595 driver bug" from "marginal hardware transient." Marco left it for another night. A characterized, validated workaround that lets him keep shipping beats a perfect post-mortem he doesn't have time to write.&lt;/p&gt;

&lt;p&gt;That's allowed. Engineering is not forensics. Sometimes "I know exactly how to avoid it and I've proven the avoidance holds 2,272 times" is the right place to stop.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why it took both of them
&lt;/h2&gt;

&lt;p&gt;It's tempting to tell this as "the AI cracked a hard bug." It's also tempting to tell it as "the AI was wrong and the human saved the day." Both are too neat. The truth is more interesting, and more useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the agent, this was unsolvable on a side project.&lt;/strong&gt; The out-of-band capture rig, the deterministic trigger, the entire falsification gauntlet, and — above all — a suite of bespoke raw-CUDA reproducers conjured on demand at midnight: that is days of specialist work, compressed into hours, executed without fatigue or shortcuts. No solo hobbyist grinds through all of that. Most would have swapped parts on a guess and either gotten lucky or given up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And without the human, the agent would have bought a power supply it didn't need.&lt;/strong&gt; Its "it's hardware, go physical" verdict was well-reasoned and confidently delivered — and premature. What broke the case wasn't more analysis. It was a human refusing the confident conclusion, sitting with a detail that didn't fit (six minutes at 284 W, but dead on a 4 W embed?), and changing the &lt;em&gt;question&lt;/em&gt; from &lt;em&gt;fix it&lt;/em&gt; to &lt;em&gt;characterize it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The agent was the instrument: precise, tireless, encyclopedic. The human was the one who held the line when the instrument pointed at the wrong door. Neither half solves this. The pairing does.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4spc84sx49kztdbsai8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4spc84sx49kztdbsai8.png" alt="A natural philosopher and a brass automaton examining the " width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. VII — Neither alone: the instrument and the judge.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Takeaways (for humans and agents alike)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A hard reset with zero kernel log is a &lt;em&gt;hardware-level&lt;/em&gt; reset&lt;/strong&gt; — the CPU never reached a &lt;code&gt;printk&lt;/code&gt;. Don't waste days grepping &lt;code&gt;dmesg&lt;/code&gt; for a fault that never gets to speak.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture out-of-band, first — and remember that silence is data.&lt;/strong&gt; A box that reboots destroys its own evidence, so stream the kernel log to another machine and flush telemetry every line. But if a capture you've &lt;em&gt;verified works&lt;/em&gt; still catches nothing at the moment of the crash, that isn't a failed setup — it's proof the fault lives below the software, where no &lt;code&gt;printk&lt;/code&gt; can reach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get a deterministic trigger before you theorize.&lt;/strong&gt; "Intermittent" is a hunt; "fires on this exact request" is an experiment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Falsify one variable at a time.&lt;/strong&gt; The multi-variable migration (driver + kernel + cables + BIOS, all at once) is exactly what makes "X is stable, Y crashes" correlations lie to you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate the &lt;em&gt;waveform&lt;/em&gt; from the &lt;em&gt;magnitude&lt;/em&gt;.&lt;/strong&gt; The single most decisive move here was proving that smooth 286 W was fine while a stuttery few-watt load was fatal. Shape, not size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A past fix is a prior, not a verdict.&lt;/strong&gt; The PSU really was the culprit once. That history made "it's power again" feel obvious — and obvious was wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat any single confident diagnosis — yours or your AI's — as a hypothesis, not a conclusion.&lt;/strong&gt; The best tooling in the world can be soundly, persuasively early.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  TL;DR — purely technical
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; NVIDIA RTX 3090, 24 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS / kernel:&lt;/strong&gt; Ubuntu 24.04, HWE kernel 6.17.0-35&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driver:&lt;/strong&gt; NVIDIA &lt;code&gt;595.71.05&lt;/code&gt; (proprietary). Last known-good: &lt;strong&gt;590&lt;/strong&gt; — but it's only built for kernel &lt;code&gt;6.17.0-22&lt;/code&gt; and the branch is retired, so reverting means pinning an old kernel permanently (no security updates).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference:&lt;/strong&gt; &lt;code&gt;llama.cpp&lt;/code&gt; (&lt;code&gt;llama-server&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models:&lt;/strong&gt; Qwen3-Embedding-4B (Q8_0, ~4 GB) on its own server, co-loaded with a 26B chat model. The two together left no headroom in 24 GB → the embed model ran at &lt;code&gt;--n-gpu-layers 20&lt;/code&gt; (&lt;strong&gt;partial offload&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Board / PSU:&lt;/strong&gt; ASUS Z390, NZXT C1200. (A prior Corsair RM850x genuinely tripped OCP on 3090 transients — a &lt;em&gt;separate, earlier&lt;/em&gt; fault already fixed by that PSU swap; not this bug.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instant, full-system hard reset (like a power cut) the moment a real inference request runs — even a single request to the 4 GB embed model alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero kernel log:&lt;/strong&gt; no &lt;code&gt;Xid&lt;/code&gt;, &lt;code&gt;NVRM&lt;/code&gt;, panic, MCE, or PCIe AER. A verified-working &lt;code&gt;netconsole&lt;/code&gt; + fsync'd on-disk telemetry caught nothing at the crash ⇒ hardware-level reset; the CPU never reaches a &lt;code&gt;printk&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Steady state right before death: ~41 °C, P2, ~116 W. Deterministic, not intermittent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Diagnostic — falsified, each with direct evidence&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GSP firmware&lt;/strong&gt; off (&lt;code&gt;NVreg_EnableGpuFirmware=0&lt;/code&gt;) → still crashed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BAR1 / VA-exhaustion&lt;/strong&gt; (&lt;code&gt;open-gpu-kernel-modules&lt;/code&gt; #1134) → would emit &lt;code&gt;Xid 31/154&lt;/code&gt;; none ever captured. &lt;code&gt;pci=realloc=off&lt;/code&gt; couldn't shrink BAR1 (kernel forces 32 GB at enumeration).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thermal&lt;/strong&gt; → crashed at 41–42 °C.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power magnitude&lt;/strong&gt; → 100 W cap (firmware floor) still crashed; yet a cuBLAS SGEMM burn at &lt;strong&gt;284 W ran 6+ min stable&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt; → 24 GB+ oversubscription survived 30+ min.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BIOS (CMOS-reset suspicion)&lt;/strong&gt; → full lever set applied &amp;amp; confirmed live (PEG link → Gen3, ReBAR off → BAR1 256 MiB, ASPM off, PCIe clock gating off, C-states clamped) → still crashed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Card health (read-only):&lt;/strong&gt; 0 PCIe replays, 0 AER, full Gen3 x16 under load, no &lt;code&gt;Xid&lt;/code&gt;, &lt;code&gt;Channel Repair Pending: No&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation (raw CUDA, pure cuBLAS — no llama/ggml):&lt;/strong&gt; bursty SGEMM (277 W), cold-burst di/dt (122→278 W), PCIe saturation (12.7 GB/s) → &lt;strong&gt;all survived&lt;/strong&gt;. llama embed &lt;strong&gt;full offload &lt;code&gt;-ngl 99&lt;/code&gt;&lt;/strong&gt; (286 W, 60 req) → &lt;strong&gt;survived&lt;/strong&gt;. llama embed &lt;strong&gt;partial offload &lt;code&gt;-ngl 20&lt;/code&gt;&lt;/strong&gt; → &lt;strong&gt;instant reset on one request&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger is the &lt;strong&gt;partial-offload inference waveform&lt;/strong&gt; — fine per-layer compute micro-bursts interleaved with PCIe-synced stalls — &lt;strong&gt;not&lt;/strong&gt; power magnitude, di/dt, PCIe bandwidth, or compute. Root cause not definitively isolated: a marginal power-delivery transient vs. a 595-series power-management bug on rapid Ampere P-state transitions. Re-testing on driver 590 would discriminate (not yet done).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Workaround&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- llama-server --model embed.gguf --n-gpu-layers 20 ...   # partial offload → crashes
&lt;/span&gt;&lt;span class="gi"&gt;+ llama-server --model embed.gguf --n-gpu-layers 99 ...   # full offload → stable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Keep the embed model &lt;strong&gt;fully GPU-resident&lt;/strong&gt;; don't co-load the 26B chat (it was what forced partial offload on 24 GB).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validated:&lt;/strong&gt; 2,272 concurrent requests, peak 312 W, zero crashes; enabled permanently as a systemd user service.&lt;/li&gt;
&lt;li&gt;Stays on the maintained &lt;strong&gt;595&lt;/strong&gt; driver — no PSU swap, no RMA, no kernel pin.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on dev.to.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>debugging</category>
      <category>gpu</category>
      <category>ai</category>
      <category>homelab</category>
    </item>
  </channel>
</rss>
