<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pytorch</title>
    <description>The latest articles tagged 'pytorch' on DEV Community.</description>
    <link>https://dev.to/t/pytorch</link>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tag/pytorch"/>
    <language>en</language>
    <item>
      <title>Debugging Score-P with PyTorch DDP: A Field Guide to CUDA Error 802 and Other Surprises</title>
      <dc:creator>Paramita Choudhury</dc:creator>
      <pubDate>Mon, 29 Jun 2026 05:17:28 +0000</pubDate>
      <link>https://dev.to/choupara/debugging-score-p-with-pytorch-ddp-a-field-guide-to-cuda-error-802-and-other-surprises-4ehe</link>
      <guid>https://dev.to/choupara/debugging-score-p-with-pytorch-ddp-a-field-guide-to-cuda-error-802-and-other-surprises-4ehe</guid>
      <description>&lt;p&gt;When I set out to instrument my multi-GPU DNABERT-2 training runs with &lt;strong&gt;Score-P&lt;/strong&gt; to analyse DDP communication overhead, I expected the hard part to be understanding the traces. Instead, the hard part turned out to be getting Score-P to coexist with PyTorch's &lt;code&gt;torchrun&lt;/code&gt;-based DDP launch mechanism at all.&lt;/p&gt;

&lt;p&gt;This post documents every error I hit and exactly how I fixed each one - in the hope that the next person trying to trace a PyTorch DDP workload with Score-P doesn't spend two days rediscovering the same root causes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The setup:&lt;/strong&gt; DNABERT-2 (117M-parameter genomic transformer), PyTorch 2.1.2, Score-P 8.1 with Python bindings, a SLURM cluster with A100-SXM4-40GB GPUs, 1/4/8-GPU configurations.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A companion post — &lt;a href="https://choupara.github.io/posts/2026/06/where-time-goes/" rel="noopener noreferrer"&gt;Where does the time really go in multi-GPU training?&lt;/a&gt; — covers what the traces actually revealed once they worked. This post is purely the war stories of getting there.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Background: two things Score-P does that fight PyTorch DDP
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The re-exec mechanism.&lt;/strong&gt; When you run &lt;code&gt;python -m scorep train.py&lt;/code&gt;, Score-P does not simply import itself and start tracing. It sets environment variables (including &lt;code&gt;LD_PRELOAD&lt;/code&gt;) to load its C measurement library, then &lt;em&gt;re-executes the entire Python process&lt;/em&gt; from scratch with those variables in place. Your script effectively starts twice: once as the launcher, once as the instrumented process. Anything that happens before the re-exec - including CUDA initialisation - happens in a process that then exits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The two-layer CUDA model.&lt;/strong&gt; CUDA has two separate APIs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;em&gt;driver API&lt;/em&gt;, used by &lt;code&gt;nvidia-smi&lt;/code&gt;, &lt;code&gt;nvmlInit()&lt;/code&gt;, etc. - what the kernel module exposes.&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;runtime API&lt;/em&gt;, used by &lt;code&gt;cudaGetDeviceCount()&lt;/code&gt;, &lt;code&gt;torch.cuda.is_available()&lt;/code&gt;, &lt;code&gt;torch.zeros(1).cuda()&lt;/code&gt; - what PyTorch and Score-P's CUDA adapter both use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On a freshly allocated SLURM job, the driver API can respond immediately while the runtime API is still initialising - sometimes for tens of seconds. Score-P's C CUDA adapter, loaded via &lt;code&gt;LD_PRELOAD&lt;/code&gt; at library-load time, probes the runtime API the moment the process starts. If the runtime isn't ready yet, the probe poisons the CUDA context for that process permanently.&lt;/p&gt;

&lt;p&gt;With that context, here are the errors.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 1: FP16 ValueError - a hidden CUDA Error 802
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; Training crashed immediately with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: FP16 Mixed precision training with AMP or APEX ('--fp16') can only
be used on CUDA devices.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looked like a config error - I was clearly on a GPU node, &lt;code&gt;nvidia-smi&lt;/code&gt; showed four A100s, yet PyTorch claimed no CUDA device.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real cause.&lt;/strong&gt; Score-P's C CUDA adapter was loaded via &lt;code&gt;LD_PRELOAD&lt;/code&gt; at startup, before PyTorch initialised CUDA. The adapter called &lt;code&gt;cudaGetDeviceCount()&lt;/code&gt; while the runtime was still in the &lt;code&gt;cudaErrorSystemNotReady&lt;/code&gt; (Error 802) state. That left the CUDA context permanently broken for the process; the later &lt;code&gt;torch.cuda.is_available()&lt;/code&gt; returned &lt;code&gt;False&lt;/code&gt;, and the FP16 ValueError was just a downstream symptom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix direction.&lt;/strong&gt; The CUDA runtime must be warmed up - forced to fully initialise - &lt;em&gt;before&lt;/em&gt; &lt;code&gt;python -m scorep&lt;/code&gt; sets &lt;code&gt;LD_PRELOAD&lt;/code&gt;. Once Score-P's C library is loaded, it's too late.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 2: the nvidia-smi check was the wrong layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First attempt.&lt;/strong&gt; A pre-flight check that polled &lt;code&gt;nvidia-smi&lt;/code&gt; until it responded:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;attempt &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 10&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if &lt;/span&gt;nvidia-smi &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null 2&amp;gt;&amp;amp;1 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ngpus_visible&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ngpus_expected&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CUDA ready after &lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt; attempt(s)."&lt;/span&gt;
        &lt;span class="nb"&gt;break
    &lt;/span&gt;&lt;span class="k"&gt;fi
    &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;3
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it wasn't enough.&lt;/strong&gt; &lt;code&gt;nvidia-smi&lt;/code&gt; uses the &lt;em&gt;driver&lt;/em&gt; API. A successful call only proves the kernel module is responding - it says nothing about whether &lt;code&gt;cudaGetDeviceCount()&lt;/code&gt; would succeed. On a fresh job, &lt;code&gt;nvidia-smi&lt;/code&gt; passes on attempt 1 while the runtime API is still &lt;code&gt;cudaErrorSystemNotReady&lt;/code&gt;. Wrong layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 3: &lt;code&gt;assert torch.cuda.is_available()&lt;/code&gt; fails immediately
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Second attempt.&lt;/strong&gt; A Python check inside the warmup loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch; assert torch.cuda.is_available(); torch.zeros(1).cuda()"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why it failed.&lt;/strong&gt; On a cold node, &lt;code&gt;torch.cuda.is_available()&lt;/code&gt; can return &lt;code&gt;False&lt;/code&gt; &lt;em&gt;without&lt;/em&gt; raising - it just returns False silently. The &lt;code&gt;assert&lt;/code&gt; then exits on the very first attempt, before the runtime had time to initialise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Drop the &lt;code&gt;assert&lt;/code&gt;. Call &lt;code&gt;torch.zeros(1).cuda()&lt;/code&gt; directly inside &lt;code&gt;try/except&lt;/code&gt; - let the exception be the "not ready yet" signal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;attempt &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 30&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    if &lt;/span&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import torch, sys
try:
    torch.zeros(1).cuda()
except Exception:
    sys.exit(1)
"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CUDA runtime ready after &lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt; attempt(s)."&lt;/span&gt;
        &lt;span class="nb"&gt;break
    &lt;/span&gt;&lt;span class="k"&gt;fi
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"CUDA runtime not ready (attempt &lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt;/30), sleeping 10s..."&lt;/span&gt;
    &lt;span class="nb"&gt;sleep &lt;/span&gt;10
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This warmup runs &lt;em&gt;without&lt;/em&gt; Score-P active - no &lt;code&gt;LD_PRELOAD&lt;/code&gt;, no CUDA adapter - so it forces the runtime to initialise once. Every later process (including the Score-P-instrumented workers) then finds the runtime already warm.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 4: a node with a permanently broken CUDA runtime
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; Even with the 30-attempt warmup (5 minutes), all attempts failed on one particular node, while &lt;code&gt;nvidia-smi&lt;/code&gt; passed on attempt 1 every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis.&lt;/strong&gt; That node had a broken CUDA runtime install: the driver was fine, but &lt;code&gt;cudaGetDeviceCount()&lt;/code&gt; never returned. A sysadmin problem, not an application one - and the scheduler kept landing my jobs there because it was first in the queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Exclude the bad node in the SLURM script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#SBATCH --exclude=&amp;lt;broken_node&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, jobs landed on healthy nodes where both checks passed on attempt 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 5: the &lt;code&gt;scorep.user&lt;/code&gt; import in DDP worker processes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; After fixing the node, 4- and 8-GPU runs still failed with Error 802 - this time in the worker processes spawned by &lt;code&gt;torchrun&lt;/code&gt;, not the main process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause.&lt;/strong&gt; I had imported &lt;code&gt;scorep.user&lt;/code&gt; at module level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scorep.user&lt;/span&gt;  &lt;span class="c1"&gt;# module-level import
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;torchrun&lt;/code&gt; spawns one worker per GPU, each worker re-imports the module, and &lt;code&gt;import scorep.user&lt;/code&gt; triggers Score-P's CUDA adapter init &lt;em&gt;inside each freshly-spawned subprocess&lt;/em&gt; - before PyTorch sets up that worker's CUDA context. The parent-shell warmup does not carry into child processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Lazy import: defer &lt;code&gt;import scorep.user&lt;/code&gt; until the first &lt;code&gt;training_step()&lt;/code&gt;, by which point PyTorch has initialised CUDA for that worker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Module level - not imported yet
&lt;/span&gt;&lt;span class="n"&gt;_scorep_user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# None = not yet tried; False = import failed
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ScorePTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;training_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_scorep_user&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_scorep_user&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scorep.user&lt;/span&gt;
                &lt;span class="n"&gt;_scorep_user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scorep&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;_scorep_user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_scorep_user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;_scorep_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;region_begin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnabert_train_step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;training_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_scorep_user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;_scorep_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;region_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dnabert_train_step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;None&lt;/code&gt; guard means the import is attempted exactly once per process, on the first step - after CUDA is ready for that rank.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 6: Score-P memory limit exceeded
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; With CUDA kernel tracing on (&lt;code&gt;SCOREP_CUDA_ENABLE=kernel,memcpy,sync&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Score-P] Warning: Too many memory requested. Score-P supports only up to,
but not including, 4 GiB of total memory per process. Reducing to its maximum value.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I had set &lt;code&gt;SCOREP_TOTAL_MEMORY=4G&lt;/code&gt;. Score-P's hard per-process limit is &lt;em&gt;strictly less than&lt;/em&gt; 4 GiB - exactly 4G hits the ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; &lt;code&gt;SCOREP_TOTAL_MEMORY=3500M&lt;/code&gt; - under the cap, with room for CUDA kernel traces across 8 ranks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 7: &lt;code&gt;load_best_model_at_end&lt;/code&gt; strategy conflict
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; The short 50-step trace runs failed instantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: --load_best_model_at_end requires the save and eval strategy to match,
but found Evaluation strategy: NO / Save strategy: STEPS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'd set &lt;code&gt;--evaluation_strategy no&lt;/code&gt; to keep eval passes from distorting the trace timeline, but &lt;code&gt;load_best_model_at_end=True&lt;/code&gt; requires matching save/eval strategies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; There is no "best model" for a 50-step diagnostic run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--evaluation_strategy&lt;/span&gt; no &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--load_best_model_at_end&lt;/span&gt; False &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--save_steps&lt;/span&gt; 10000 &lt;span class="se"&gt;\&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Error 8: the trace contained the launcher, not the workers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; The runs completed, produced an OTF2 trace, and opened cleanly in Vampir - showing a single red bar: &lt;code&gt;...LocalElasticAgent._invoke_run&lt;/code&gt;. No kernels. No NCCL. No training steps. The process filter listed exactly &lt;strong&gt;one&lt;/strong&gt; process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F57w03g8bitfl1tobo1ha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F57w03g8bitfl1tobo1ha.png" alt="Vampir Master Timeline of the launcher-only trace: a single red  raw `_invoke_run` endraw  bar spanning the entire run, no GPU kernels, one process in the filter" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause.&lt;/strong&gt; The launcher was &lt;code&gt;python -m scorep .../scorep_torchrun.py&lt;/code&gt;, where &lt;code&gt;scorep_torchrun.py&lt;/code&gt; is just &lt;code&gt;from torch.distributed.run import main; main()&lt;/code&gt; - i.e. plain &lt;code&gt;torchrun&lt;/code&gt;. Its elastic agent &lt;strong&gt;spawns the GPU workers as separate child processes&lt;/strong&gt; that start fresh &lt;code&gt;python&lt;/code&gt; interpreters with no Score-P. Score-P therefore instrumented only the agent - the babysitter - which spends the whole run waiting. On disk the proof was unambiguous: the entire 8-GPU trace held a single process's events, and &lt;code&gt;scorep-score&lt;/code&gt; showed only Python (&lt;code&gt;USR&lt;/code&gt;) regions - no &lt;code&gt;CUDA&lt;/code&gt; type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;traces/
&lt;span class="gp"&gt;0.def   0.evt                          #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;one process, not eight
&lt;span class="go"&gt;
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;scorep-score profile.cubex
&lt;span class="go"&gt;flt  type  max_buf[B]   visits  time[s] time[%]  region
     ALL  16,266,573  625,628   38.47   100.0   ALL
     USR  16,266,302  625,627   38.04    98.9   USR     ← all Python, the agent waiting
  SCOREP        271        1    0.43     1.1   SCOREP
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Stop letting &lt;code&gt;torchrun&lt;/code&gt; spawn. Launch each rank yourself in a background loop, each as its own &lt;code&gt;python -m scorep&lt;/code&gt; process with its own &lt;code&gt;SCOREP_EXPERIMENT_DIRECTORY=scorep_rank_N&lt;/code&gt;, using a single-node rendezvous (&lt;code&gt;MASTER_ADDR=localhost&lt;/code&gt;, per-rank &lt;code&gt;RANK&lt;/code&gt;/&lt;code&gt;LOCAL_RANK&lt;/code&gt;). This is exactly what &lt;code&gt;torchrun&lt;/code&gt; does internally - fork N ranks, hand each its identity - except now every rank runs under Score-P. No &lt;code&gt;srun&lt;/code&gt;, no spawn.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 9: &lt;code&gt;SCOREP_CUDA_ENABLE&lt;/code&gt; captured zero kernels
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; With per-rank launch working, the traces contained the training process - but &lt;strong&gt;zero CUDA kernels&lt;/strong&gt;. &lt;code&gt;scorep-score&lt;/code&gt; showed 98% &lt;code&gt;USR&lt;/code&gt; (Python) regions and no &lt;code&gt;CUDA&lt;/code&gt; type at all, despite &lt;code&gt;SCOREP_CUDA_ENABLE=kernel,memcpy,sync&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause.&lt;/strong&gt; The Score-P Python wrapper passes unknown flags to &lt;code&gt;scorep-config&lt;/code&gt;, whose help is explicit: &lt;code&gt;--cuda|--nocuda … On default cuda instrumentation is disabled.&lt;/code&gt; Setting &lt;code&gt;SCOREP_CUDA_ENABLE&lt;/code&gt; only configures &lt;em&gt;what&lt;/em&gt; the CUDA adapter records - but without &lt;code&gt;--cuda&lt;/code&gt; the adapter is never loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Add &lt;code&gt;--cuda&lt;/code&gt; to the launch: &lt;code&gt;python -m scorep --cuda --thread=pthread&lt;/code&gt;. A wrinkle: do &lt;strong&gt;not&lt;/strong&gt; add the documented &lt;code&gt;--&lt;/code&gt; script separator - this wrapper version forwards it to &lt;code&gt;scorep-config&lt;/code&gt;, which rejects it (&lt;code&gt;Unknown option: '--'&lt;/code&gt;). After the fix, a &lt;code&gt;CUDA&lt;/code&gt; type appears in &lt;code&gt;scorep-score&lt;/code&gt;, with ~106 named GPU kernel regions per rank - and, crucially, the NCCL collectives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;scorep-score profile.cubex
&lt;span class="go"&gt;flt  type  max_buf[B]     visits  time[s] time[%]  region
     ALL  73,352,737  3,001,818   41.63   100.0   ALL
     USR  73,351,928  2,821,228   36.83    88.5   USR
    CUDA   2,347,618     90,294    3.99     9.6   CUDA   ← GPU kernels now captured

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;scorep-score &lt;span class="nt"&gt;-r&lt;/span&gt; profile.cubex | &lt;span class="nb"&gt;grep &lt;/span&gt;CUDA | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-k4&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;span class="go"&gt;  CUDA    700 visits  2.38s  ncclKernel_AllReduce_RING_LL_Sum_float   ← gradient sync
  CUDA     62 visits  0.09s  ncclKernel_AllGather_RING_LL_Sum_int8_t
&lt;/span&gt;&lt;span class="gp"&gt;  CUDA  9,538 visits  0.09s  at::native::unrolled_elementwise_kernel&amp;lt;...&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="gp"&gt;  CUDA  8,376 visits  0.08s  at::native::elementwise_kernel&amp;lt;128, 4, ...&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Error 10: CUPTI buffer overflow at 8 ranks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; The 1- and 4-GPU traces were clean, but the 8-GPU run dropped records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[CUPTI Activity] Dropped 85222 records. Current buffer size: 1048576 bytes
Proposed minimum SCOREP_CUDA_BUFFER=8889000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cause.&lt;/strong&gt; Eight ranks each profiling through CUPTI overran the default 1 MB per-process CUDA activity buffer between flushes, silently discarding kernel records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Score-P told us the answer in the warning. &lt;code&gt;SCOREP_CUDA_BUFFER=64M&lt;/code&gt; for generous headroom. No more dropped records.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error 11: the NCCL watchdog vs. Score-P shutdown race
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Symptom.&lt;/strong&gt; The 8-GPU run finished training but then &lt;strong&gt;6 of 8 ranks aborted&lt;/strong&gt; during teardown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terminate called after throwing an instance of 'c10::Error'
  what():  Should never been called   (dummyHasPrimaryContext)
  ... c10d::ProcessGroupNCCL::ncclCommWatchdog()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The aborts struck &lt;em&gt;after&lt;/em&gt; training, killing the process before Score-P flushed its profile - so those ranks left no trace on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause.&lt;/strong&gt; A shutdown &lt;strong&gt;race&lt;/strong&gt;: PyTorch's background NCCL watchdog thread runs its cleanup destructor (which touches the CUDA device) at interpreter exit, at the same time Score-P tears down its CUDA context. Whichever loses, crashes. It's non-deterministic - a later 4-GPU run lost the race where an 8-GPU run had won it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix.&lt;/strong&gt; Remove the race instead of fighting it: call &lt;code&gt;torch.distributed.destroy_process_group()&lt;/code&gt; at the end of &lt;code&gt;train()&lt;/code&gt;, so NCCL is torn down &lt;em&gt;cleanly, before&lt;/em&gt; the interpreter (and Score-P) begin shutdown. With the process group gone, there's no watchdog destructor left to collide with Score-P's teardown, and all eight ranks flush reliably.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final state: per-rank GPU traces collected
&lt;/h2&gt;

&lt;p&gt;After all eleven fixes, every DDP rank ran under its own Score-P measurement, capturing each worker's GPU kernels &lt;em&gt;and&lt;/em&gt; NCCL communication.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fncvy88rtsd7wtxamvxr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fncvy88rtsd7wtxamvxr2.png" alt="Vampir Function Summary for an 8-GPU rank:  raw `ncclKernel_AllReduce_RING_LL_Sum_float` endraw  at 2.375 s, sitting right beside  raw `torch.autograd:backward` endraw  at 2.22 s — gradient synchronisation costs as much GPU time as the entire backward pass" width="780" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Samples/sec&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 GPU&lt;/td&gt;
&lt;td&gt;478.8 s&lt;/td&gt;
&lt;td&gt;75.0&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 GPU&lt;/td&gt;
&lt;td&gt;103.8 s&lt;/td&gt;
&lt;td&gt;345.9&lt;/td&gt;
&lt;td&gt;4.61×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 GPU&lt;/td&gt;
&lt;td&gt;52.8 s&lt;/td&gt;
&lt;td&gt;680.7&lt;/td&gt;
&lt;td&gt;9.08×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On 1 GPU there is zero NCCL; from 4 GPUs the gradient AllReduce appears, and by 8 GPUs &lt;code&gt;ncclKernel_AllReduce&lt;/code&gt; is the single largest GPU activity (~2.375 s, comparable to the entire backward pass) - yet it overlaps backward compute on a separate CUDA stream, which is why throughput still scales near-linearly. The Master Timeline makes the overlap visible: compute runs on the default stream &lt;code&gt;CUDA[0:7]&lt;/code&gt; (&lt;code&gt;CUDA_NULL_STREAM&lt;/code&gt;) while &lt;code&gt;ncclKernel_AllReduce&lt;/code&gt; runs &lt;em&gt;concurrently&lt;/em&gt; on a separate stream &lt;code&gt;CUDA[0:20]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1neg9mq4v9z3tobv7coh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1neg9mq4v9z3tobv7coh.png" alt="Vampir Master Timeline zoomed to a few training steps: dense compute kernels on the default stream  raw `CUDA[0:7]` endraw  run at the same time as  raw `ncclKernel_AllReduce` endraw  blocks on stream  raw `CUDA[0:20]` endraw  — communication overlapped with backward compute, so most of its cost is hidden from wall-clock" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What that decomposition &lt;em&gt;means&lt;/em&gt; is the subject of the companion post.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Score-P's re-exec is not optional - design around it.&lt;/strong&gt; Any CUDA init that must happen before Score-P's C adapter loads has to happen before the &lt;code&gt;python -m scorep&lt;/code&gt; call in your shell script.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The driver and runtime APIs are different things.&lt;/strong&gt; &lt;code&gt;nvidia-smi&lt;/code&gt; passing is necessary but not sufficient. Test the runtime directly (&lt;code&gt;torch.zeros(1).cuda()&lt;/code&gt;), and use &lt;code&gt;try/except&lt;/code&gt;, not &lt;code&gt;is_available()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Module-level imports of the Score-P user API break DDP workers.&lt;/strong&gt; Each rank is a fresh subprocess; lazy-import inside the first method PyTorch guarantees runs after CUDA setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A broken node will burn your budget on CUDA timeouts.&lt;/strong&gt; &lt;code&gt;--exclude&lt;/code&gt; it as soon as you spot it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate the diagnostic trace from the full run.&lt;/strong&gt; CUDA-enabled &lt;code&gt;--max_steps 50&lt;/code&gt; with &lt;code&gt;--evaluation_strategy no&lt;/code&gt; gives clean, size-controlled traces; the full run with CUDA off gives robust scaling stats. You need both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To trace DDP workers, launch them yourself - don't let &lt;code&gt;torchrun&lt;/code&gt; spawn.&lt;/strong&gt; Replace it with a background loop where each rank is its own &lt;code&gt;python -m scorep&lt;/code&gt; process. This is the single most important structural fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Setting an env var is not the same as loading the adapter.&lt;/strong&gt; &lt;code&gt;SCOREP_CUDA_ENABLE&lt;/code&gt; configures the CUDA adapter; &lt;code&gt;--cuda&lt;/code&gt; &lt;em&gt;loads&lt;/em&gt; it. Confirm a &lt;code&gt;CUDA&lt;/code&gt; type appears in &lt;code&gt;scorep-score&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile sums overstate communication - use the timeline for wall-clock truth.&lt;/strong&gt; NCCL LL kernels busy-wait, and DDP overlaps AllReduce with backward compute on a separate stream, so summed kernel-time double-counts the overlap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tear down NCCL cleanly so it doesn't race your profiler at exit.&lt;/strong&gt; &lt;code&gt;torch.distributed.destroy_process_group()&lt;/code&gt; at the end of training removes the watchdog before interpreter shutdown.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;The OTF2 traces behind this post were generated on a SLURM cluster (A100-SXM4-40GB) as part of a Score-P performance analysis of DDP training scaling for the DNABERT-2 genomic classifier.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hpc</category>
      <category>gpu</category>
      <category>pytorch</category>
      <category>performance</category>
    </item>
    <item>
      <title>Classifier-free guidance above 7.5 oversaturated our product renders</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 26 Jun 2026 05:36:29 +0000</pubDate>
      <link>https://dev.to/elise_moreau/classifier-free-guidance-above-75-oversaturated-our-product-renders-10aj</link>
      <guid>https://dev.to/elise_moreau/classifier-free-guidance-above-75-oversaturated-our-product-renders-10aj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Classifier-free guidance above a scale of ~7.5 pushed our SDXL product renders into oversaturation and clipped highlights. Adding CFG rescale at 0.7 plus dynamic thresholding fixed it with no retraining.&lt;/p&gt;

&lt;p&gt;Around 18% of our automated product renders at Photoroom came back with blown-out highlights and oversaturated color once we raised the classifier-free guidance scale from 5.0 to 9.0 on our fine-tuned SDXL pipeline. The higher scale gave us sharper adherence to the prompt, which the catalog team wanted, but white backgrounds shifted toward grey-blue and metallic surfaces lost their specular detail. To be precise, the problem was not the prompt and not the fine-tune. It was the guidance arithmetic itself interacting with the noise schedule, and it is well documented if you know where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  What classifier-free guidance actually does
&lt;/h2&gt;

&lt;p&gt;Classifier-free guidance combines two model predictions at each denoising step: one conditioned on the prompt and one unconditioned. The sampler extrapolates along the vector between them, scaled by a guidance weight. A weight of 1.0 means no guidance, and weights of 5 to 9 are typical for SDXL. Higher weights increase prompt adherence at the cost of pushing latents outside the distribution the model was trained on.&lt;/p&gt;

&lt;p&gt;The method comes from Ho and Salimans in &lt;a href="https://arxiv.org/abs/2207.12598" rel="noopener noreferrer"&gt;Classifier-Free Diffusion Guidance&lt;/a&gt;. The formula at each step is straightforward: take the unconditional prediction, add the guidance scale times the difference between conditional and unconditional. The nuance here is that this extrapolation has no bound. As you raise the scale, the standard deviation of the guided prediction grows past the statistics the model learned, and that excess energy shows up in the decoded image as clipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why high guidance scales oversaturate
&lt;/h2&gt;

&lt;p&gt;The decoded pixel range is fixed, roughly [-1, 1] before the VAE maps it back to RGB. When guidance inflates the variance of the predicted noise, the resulting latents carry larger magnitudes than the VAE was trained to reconstruct cleanly. Bright regions saturate to pure white, and color channels drift because the per-channel means shift together. We measured this directly: at guidance 9.0 the per-image latent standard deviation was about 1.4x the standard deviation of the conditional prediction alone.&lt;/p&gt;

&lt;p&gt;This is the same failure mode the Imagen team described in &lt;a href="https://arxiv.org/abs/2205.11487" rel="noopener noreferrer"&gt;Photorealistic Text-to-Image Diffusion Models&lt;/a&gt;, where high guidance weights produced saturated, unnatural images. Their answer was dynamic thresholding. A second, complementary fix came later from Lin and colleagues in &lt;a href="https://arxiv.org/abs/2305.08891" rel="noopener noreferrer"&gt;Common Diffusion Noise Schedules and Sample Steps are Flawed&lt;/a&gt;, which introduced guidance rescale to bring the guided prediction's variance back in line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two fixes that stack: CFG rescale and dynamic thresholding
&lt;/h2&gt;

&lt;p&gt;CFG rescale corrects the standard deviation of the guided prediction toward the conditional prediction, then blends between the corrected and raw versions by a factor. We set that factor to 0.7 after a sweep. Here is the core of what we run inside the sampler loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_cfg_rescale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;noise_cond&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guidance_scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# standard classifier-free guidance
&lt;/span&gt;    &lt;span class="n"&gt;noise_cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;guidance_scale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;noise_cond&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;noise_uncond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# rescale variance back toward the conditional prediction (Lin et al. 2023)
&lt;/span&gt;    &lt;span class="n"&gt;std_cond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cond&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;std_cfg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;noise_rescaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std_cond&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;std_cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# blend corrected and raw so detail is not fully flattened
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;noise_rescaled&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;guidance_rescale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;noise_cfg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dynamic thresholding works at a different layer. At each step it predicts the clean sample, computes a high percentile of the absolute pixel values (we use the 99.5th), and clamps to that value before renormalizing. The two corrections address different symptoms. Rescale fixes the variance inflation; thresholding clamps the residual outliers that survive. Running both at guidance 9.0 brought our oversaturation rate from 18% to under 2% on a held-out set of 4,000 SKUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we chose the rescale factor
&lt;/h2&gt;

&lt;p&gt;We swept the rescale factor across 0.0, 0.3, 0.5, 0.7, and 1.0 and scored each batch on two axes. The first was a saturation metric: the fraction of pixels with channel values above 0.97 after decoding. The second was CLIP image-text similarity, so we did not trade away the prompt adherence we raised guidance to get. A factor of 1.0 fully matched the conditional variance but flattened contrast on glossy products. A factor of 0.0 left the original problem. The factor of 0.7 held CLIP similarity within 0.4% of the unrescaled run while cutting the saturated-pixel fraction by more than half.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;CFG rescale adds two standard deviation reductions and an elementwise blend per step. On our pipeline that is well under 1% of step latency, so cost is not the concern. The real trade-off is contrast. At rescale factors above 0.8 we saw glossy and metallic products lose specular punch, which matters for jewelry and electronics catalogs. Dynamic thresholding has its own edge case: on images that are genuinely meant to be bright and high-key, an aggressive percentile clamps legitimate highlights, so we tuned the percentile per product category rather than globally.&lt;/p&gt;

&lt;p&gt;There is also a simpler path we rejected. You can lower the guidance scale back to 5.0 and avoid the whole question, but you lose the prompt fidelity the catalog team asked for. The corrections let us keep a scale of 8.0 to 9.0 without the artifacts, which was the actual goal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;If your renders saturate at high classifier-free guidance, measure the per-image latent standard deviation against the conditional-only prediction before reaching for retraining. The fix is almost always at the guidance arithmetic, not the weights. I would start with CFG rescale at 0.7, add dynamic thresholding only if outliers remain, and validate with a saturated-pixel metric alongside CLIP similarity so you do not silently trade away adherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2207.12598" rel="noopener noreferrer"&gt;Classifier-Free Diffusion Guidance, Ho and Salimans, 2022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2305.08891" rel="noopener noreferrer"&gt;Common Diffusion Noise Schedules and Sample Steps are Flawed, Lin et al., 2023&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2205.11487" rel="noopener noreferrer"&gt;Photorealistic Text-to-Image Diffusion Models, Saharia et al., 2022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl" rel="noopener noreferrer"&gt;Diffusers guidance_rescale documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>ComfyUI 'Torch not compiled with CUDA enabled'? Every Fix That Works on Windows, Linux, and Mac (2026)</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Wed, 24 Jun 2026 07:06:37 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/comfyui-torch-not-compiled-with-cuda-enabled-every-fix-that-works-on-windows-linux-and-mac-556m</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/comfyui-torch-not-compiled-with-cuda-enabled-every-fix-that-works-on-windows-linux-and-mac-556m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/comfyui-torch-not-compiled-with-cuda-enabled-fix-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: This error means the PyTorch you have installed is the CPU-only build — it literally has no CUDA code compiled in, so it can't see your GPU even though the driver is fine. The fix is never to reinstall CUDA or your GPU driver; it's to uninstall the CPU &lt;code&gt;torch&lt;/code&gt; and reinstall the matching &lt;code&gt;cu12x&lt;/code&gt; wheel from PyTorch's own index. On an RTX 50-series card you need the &lt;code&gt;cu128&lt;/code&gt; build specifically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll be able to do after this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confirm in 10 seconds whether your &lt;code&gt;torch&lt;/code&gt; is the CPU build or the GPU build&lt;/li&gt;
&lt;li&gt;Reinstall the correct CUDA wheel in both ComfyUI portable and a manual venv install&lt;/li&gt;
&lt;li&gt;Pick the right &lt;code&gt;cu124&lt;/code&gt; / &lt;code&gt;cu126&lt;/code&gt; / &lt;code&gt;cu128&lt;/code&gt; wheel for your exact GPU (and know why RTX 50-series is different)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: 90% of the time this happens because a custom node ran &lt;code&gt;pip install&lt;/code&gt; something, pip pulled &lt;code&gt;torch&lt;/code&gt; as a dependency, and on Windows the default PyPI &lt;code&gt;torch&lt;/code&gt; wheel is CPU-only. You didn't break CUDA — pip quietly swapped your good GPU build for a smaller CPU one. Reinstalling the right wheel takes about three minutes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What the error actually means
&lt;/h2&gt;

&lt;p&gt;When ComfyUI starts (or the first time it tries to move a model to the GPU) you get a traceback ending in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;orch\cuda\__init__.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;310&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_lazy_init&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AssertionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Torch not compiled with CUDA enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;AssertionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Torch&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;compiled&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;CUDA&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read it literally: the &lt;em&gt;PyTorch binary&lt;/em&gt; you installed was built without CUDA support. PyTorch ships in separate flavors — a CPU-only wheel and several CUDA wheels (&lt;code&gt;cu124&lt;/code&gt;, &lt;code&gt;cu126&lt;/code&gt;, &lt;code&gt;cu128&lt;/code&gt;, etc.). The CPU wheel is a completely different binary with no GPU kernels in it. No driver update, no CUDA Toolkit install, and no environment variable will add CUDA to a CPU wheel. You have to replace the wheel.&lt;/p&gt;

&lt;p&gt;This is different from a &lt;em&gt;driver&lt;/em&gt; problem. If your NVIDIA driver were missing, &lt;code&gt;nvidia-smi&lt;/code&gt; would fail. Run it in a terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nvidia-smi
&lt;span class="go"&gt;+-----------------------------------------------------------------------------+
| NVIDIA-SMI 581.xx       Driver Version: 581.xx       CUDA Version: 12.8     |
|   0  NVIDIA GeForce RTX 4070 ...                                            |
+-----------------------------------------------------------------------------+
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; shows your card, your driver is fine and the problem is 100% on the PyTorch side. (The "CUDA Version: 12.8" line here is the &lt;em&gt;maximum&lt;/em&gt; CUDA the driver supports, not the version PyTorch needs — a common point of confusion.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Confirm you actually have the CPU build
&lt;/h2&gt;

&lt;p&gt;Before changing anything, prove the diagnosis. ComfyUI portable ships its own Python under &lt;code&gt;python_embeded&lt;/code&gt;, so use that exact interpreter — not whatever &lt;code&gt;python&lt;/code&gt; resolves to in your PATH. From the &lt;code&gt;ComfyUI_windows_portable&lt;/code&gt; folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;.\python_embeded\python.exe &lt;span class="na"&gt;-c &lt;/span&gt;&lt;span class="s2"&gt;"import torch; print(torch.__version__); print(torch.cuda.is_available())"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A CPU build prints something like this — note the &lt;code&gt;+cpu&lt;/code&gt; suffix and &lt;code&gt;False&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.8.0+cpu
False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A working GPU build prints a CUDA tag (&lt;code&gt;+cu128&lt;/code&gt;) and &lt;code&gt;True&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.8.0+cu128
True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see &lt;code&gt;+cpu&lt;/code&gt; or &lt;code&gt;False&lt;/code&gt;, this guide fixes you. If you see &lt;code&gt;+cu128&lt;/code&gt; and &lt;code&gt;True&lt;/code&gt; but ComfyUI &lt;em&gt;still&lt;/em&gt; throws the error, you have two Python environments and ComfyUI is launching the wrong one — skip to the "Two-environments trap" section below.&lt;/p&gt;

&lt;p&gt;For a manual (cloned-repo) install, run the same one-liner but activate your venv first, or call the venv's Python directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Windows venv&lt;/span&gt;
.&lt;span class="se"&gt;\v&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\p&lt;/span&gt;ython.exe &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch; print(torch.__version__, torch.cuda.is_available())"&lt;/span&gt;

&lt;span class="c"&gt;# Linux/Mac venv&lt;/span&gt;
./venv/bin/python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch; print(torch.__version__, torch.cuda.is_available())"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Pick the right CUDA wheel for your GPU
&lt;/h2&gt;

&lt;p&gt;This is the part people get wrong. The wheel tag (&lt;code&gt;cu124&lt;/code&gt;, &lt;code&gt;cu126&lt;/code&gt;, &lt;code&gt;cu128&lt;/code&gt;) is the CUDA runtime &lt;em&gt;bundled inside&lt;/em&gt; the PyTorch wheel. It does not need to match a CUDA Toolkit on your machine — the wheel is self-contained. What it &lt;em&gt;does&lt;/em&gt; need to match is your GPU architecture.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your GPU&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Wheel to install&lt;/th&gt;
&lt;th&gt;Minimum PyTorch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 50-series (5060 Ti / 5070 / 5080 / 5090)&lt;/td&gt;
&lt;td&gt;Blackwell, &lt;code&gt;sm_120&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;cu128&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.7.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 40-series (4060 Ti / 4070 / 4080 / 4090)&lt;/td&gt;
&lt;td&gt;Ada, &lt;code&gt;sm_89&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cu124&lt;/code&gt;, &lt;code&gt;cu126&lt;/code&gt;, or &lt;code&gt;cu128&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;any current&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 30-series (3060 / 3080 / 3090)&lt;/td&gt;
&lt;td&gt;Ampere, &lt;code&gt;sm_86&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cu124&lt;/code&gt;, &lt;code&gt;cu126&lt;/code&gt;, or &lt;code&gt;cu128&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;any current&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The RTX 50-series is the trap. Blackwell's &lt;code&gt;sm_120&lt;/code&gt; compute capability was only added to &lt;strong&gt;stable&lt;/strong&gt; PyTorch in 2.7.0, which shipped the first pre-built CUDA 12.8 wheels with native Blackwell support. If you install an older &lt;code&gt;cu124&lt;/code&gt; wheel on an RTX 5090, you'll get past &lt;em&gt;this&lt;/em&gt; error only to hit &lt;code&gt;CUDA error: no kernel image is available for execution on the device&lt;/code&gt; — the sibling problem of running a too-old wheel on a too-new GPU. On a 50-series card, use &lt;code&gt;cu128&lt;/code&gt; and PyTorch 2.7.0 or newer, full stop.&lt;/p&gt;

&lt;p&gt;For RTX 30/40-series, any of &lt;code&gt;cu124&lt;/code&gt;/&lt;code&gt;cu126&lt;/code&gt;/&lt;code&gt;cu128&lt;/code&gt; works; &lt;code&gt;cu128&lt;/code&gt; is the safe current default since it's what ComfyUI's own portable builds ship now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Reinstall — ComfyUI portable (Windows)
&lt;/h2&gt;

&lt;p&gt;From inside the &lt;code&gt;ComfyUI_windows_portable&lt;/code&gt; directory, uninstall the bad trio first so pip doesn't try to "keep" the CPU build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;.\python_embeded\python.exe &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="kd"&gt;pip&lt;/span&gt; &lt;span class="kd"&gt;uninstall&lt;/span&gt; &lt;span class="na"&gt;-y &lt;/span&gt;&lt;span class="kd"&gt;torch&lt;/span&gt; &lt;span class="kd"&gt;torchvision&lt;/span&gt; &lt;span class="kd"&gt;torchaudio&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then install the CUDA wheel from PyTorch's index. For an RTX 50-series card:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;.\python_embeded\python.exe &lt;span class="na"&gt;-m &lt;/span&gt;&lt;span class="kd"&gt;pip&lt;/span&gt; &lt;span class="kd"&gt;install&lt;/span&gt; &lt;span class="kd"&gt;torch&lt;/span&gt; &lt;span class="kd"&gt;torchvision&lt;/span&gt; &lt;span class="kd"&gt;torchaudio&lt;/span&gt; &lt;span class="na"&gt;--index-url &lt;/span&gt;&lt;span class="kd"&gt;https&lt;/span&gt;://download.pytorch.org/whl/cu128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note &lt;code&gt;--index-url&lt;/code&gt;, not &lt;code&gt;--extra-index-url&lt;/code&gt;. Using &lt;code&gt;--index-url&lt;/code&gt; forces pip to pull &lt;em&gt;only&lt;/em&gt; from the PyTorch index, which guarantees you get the GPU wheel instead of pip silently falling back to the CPU-only one on PyPI. That fallback is the exact mechanism that broke you in the first place.&lt;/p&gt;

&lt;p&gt;Re-run the check from Step 1. You want &lt;code&gt;+cu128&lt;/code&gt; and &lt;code&gt;True&lt;/code&gt;. Then launch ComfyUI and the error is gone.&lt;/p&gt;

&lt;p&gt;If the download is slow or stalls — the CUDA wheels are large, often 2.5 GB-plus because they bundle the CUDA runtime, cuDNN, and NCCL — let it finish; that size is normal and is the whole reason PyPI defaults to the small CPU wheel on Windows in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Reinstall — manual / venv install (Windows, Linux)
&lt;/h2&gt;

&lt;p&gt;If you cloned the repo and run inside a venv, activate it, then do the same uninstall/reinstall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# activate first&lt;/span&gt;
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate          &lt;span class="c"&gt;# Linux/Mac&lt;/span&gt;
.&lt;span class="se"&gt;\v&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate           &lt;span class="c"&gt;# Windows&lt;/span&gt;

pip uninstall &lt;span class="nt"&gt;-y&lt;/span&gt; torch torchvision torchaudio
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch torchvision torchaudio &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ComfyUI also publishes a maintained requirements path; if you'd rather follow the project's pinned versions, the official install docs list the current recommended &lt;code&gt;cu128&lt;/code&gt; command for your platform. Either way the principle is identical: uninstall CPU &lt;code&gt;torch&lt;/code&gt;, install the &lt;code&gt;cu128&lt;/code&gt; wheel from the PyTorch index.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this keeps happening (and how to stop it)
&lt;/h2&gt;

&lt;p&gt;On Windows and macOS, the &lt;code&gt;torch&lt;/code&gt; package on the default Python Package Index (PyPI) is the &lt;strong&gt;CPU-only&lt;/strong&gt; wheel. PyPI serves the lightweight CPU binary by default to those platforms; the CUDA-enabled wheels live only on PyTorch's own download index. So the moment &lt;em&gt;anything&lt;/em&gt; runs a plain &lt;code&gt;pip install torch&lt;/code&gt; — or installs a package that lists &lt;code&gt;torch&lt;/code&gt; as a dependency without pinning the CUDA build — pip happily grabs the CPU wheel from PyPI and overwrites your working GPU install.&lt;/p&gt;

&lt;p&gt;The usual culprit is a &lt;strong&gt;custom node&lt;/strong&gt;. You install some shiny new node, its &lt;code&gt;requirements.txt&lt;/code&gt; says &lt;code&gt;torch&amp;gt;=2.x&lt;/code&gt;, ComfyUI's "install dependencies" step runs, pip decides your current torch doesn't satisfy something, and it reinstalls from PyPI — CPU build. ComfyUI was fine yesterday and broken today, and you "didn't change anything." You did: a node did.&lt;/p&gt;

&lt;p&gt;Two habits prevent the relapse:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;When installing custom-node requirements, never let pip touch torch.&lt;/strong&gt; If a node's requirements pull torch, install the node&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>comfyui</category>
      <category>cuda</category>
      <category>pytorch</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Using the channels-last memory format reduced the latency of our conversation backbone by 22%</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Wed, 24 Jun 2026 05:36:21 +0000</pubDate>
      <link>https://dev.to/elise_moreau/channels-last-memory-format-cut-our-conv-backbone-latency-22-19l2</link>
      <guid>https://dev.to/elise_moreau/channels-last-memory-format-cut-our-conv-backbone-latency-22-19l2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.&lt;/p&gt;

&lt;p&gt;Our background-removal model at Photoroom spent roughly 31 ms per 1024x1024 image on an A100, and profiling pointed most of that time at cuDNN convolution kernels rather than our diffusion sampler. The model is a fairly standard U-Net style encoder-decoder, all convolutions, running in float16 under &lt;code&gt;torch.autocast&lt;/code&gt;. Before touching the architecture, I wanted to rule out the cheap wins, and the cheapest one turned out to be tensor memory layout. The channels-last memory format gave us most of the speedup we were chasing, and the change fit in a handful of lines. To be precise, the network math is identical; only the byte order of the activations changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What channels-last memory format changes
&lt;/h2&gt;

&lt;p&gt;The channels-last memory format stores a 4D activation tensor in NHWC byte order, keeping the channel values for one spatial position contiguous in memory. PyTorch keeps the logical NCHW shape, so your indexing and your model code stay the same. What changes is the stride pattern, which lets cuDNN select kernels that read contiguous channels and run more efficiently on tensor-core hardware.&lt;/p&gt;

&lt;p&gt;The default PyTorch layout is NCHW (channels-first), where all of one channel's pixels sit together. NVIDIA's tensor cores prefer the NHWC arrangement for convolutions, as documented in their &lt;a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html" rel="noopener noreferrer"&gt;convolution performance guide&lt;/a&gt;. When your tensors arrive in NCHW, cuDNN often inserts transpose passes around each convolution to reshuffle data, and those transposes are pure overhead. Converting once at the input and keeping the format consistent removes that per-layer reshuffling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Converting a PyTorch model to channels-last
&lt;/h2&gt;

&lt;p&gt;The conversion API has been stable since well before PyTorch 2.3, and the &lt;a href="https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html" rel="noopener noreferrer"&gt;official memory format tutorial&lt;/a&gt; covers the details. Two things need the format: the module parameters and the input tensor. If only one of them is channels-last, cuDNN falls back to NCHW kernels and you gain nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# convert the model's conv weights once, at load time
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# convert each input batch to match
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels_last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;autocast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# output is channels_last; convert back if a
&lt;/span&gt;                  &lt;span class="c1"&gt;# downstream op needs contiguous NCHW
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One subtlety worth checking: &lt;code&gt;x.to(memory_format=torch.channels_last)&lt;/code&gt; is a no-op on a 3D tensor, so make sure your inputs carry an explicit batch dimension. After the forward pass, the output keeps channels-last strides. If you feed it into an operation that assumes contiguous NCHW, call &lt;code&gt;.contiguous()&lt;/code&gt; there rather than reverting the whole pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why NHWC is faster on tensor cores
&lt;/h2&gt;

&lt;p&gt;Tensor cores execute matrix-multiply-accumulate on small tiles, and convolutions get lowered to those tile operations. With NHWC layout the channel dimension, which is the contracting dimension of the implicit matmul, is contiguous, so the kernel loads aligned vectors without gathering strided data. The effect grows with channel count. Our deepest encoder blocks at 512 channels saw the largest per-layer improvement, while the early high-resolution layers at 64 channels barely moved.&lt;/p&gt;

&lt;p&gt;The gain also depends on precision. Channels-last pairs with float16 or bfloat16, because tensor cores only engage in reduced precision; in pure float32 the kernels often route through CUDA cores where the layout advantage shrinks. We were already running float16 under autocast, so the two optimizations stacked. The nuance here is that channels-last is not a free win in every configuration. It is a win when your convolutions are wide, your precision is reduced, and your hardware has tensor cores.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring the speedup without fooling yourself
&lt;/h2&gt;

&lt;p&gt;A layout change is easy to misattribute, so I measured carefully. I ran 200 warmup iterations, then timed 1000 forward passes with &lt;code&gt;torch.cuda.synchronize()&lt;/code&gt; around each measurement window, since CUDA calls are asynchronous and an unsynchronized timer reports queue time rather than kernel time. I also confirmed the output tensors matched the NCHW baseline within float16 tolerance, so I knew I was timing the same computation.&lt;/p&gt;

&lt;p&gt;The headline number was a drop from roughly 31 ms to 24 ms per image, about 22% on our A100. On a V100 the same change gave closer to 14%, which tracks with its older tensor-core generation. I would treat any single-number claim with suspicion until you reproduce it on your own shapes; the benefit is real but hardware-dependent and model-dependent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The format is not universally beneficial. Networks dominated by pointwise operations, normalization, or attention rather than spatial convolutions show little or no improvement, because those ops do not hit the cuDNN convolution path that NHWC accelerates. Transformer backbones, for instance, rarely care.&lt;/p&gt;

&lt;p&gt;There is also a correctness trap. Mixing layouts inside a model can silently insert transposes that erase the gain, and some custom operators or older third-party layers assume contiguous NCHW and will either copy or error. If you run &lt;code&gt;torch.compile&lt;/code&gt;, verify the format survives the traced graph rather than assuming it does. For very small channel counts the conversion overhead can outweigh the kernel savings, so profile before committing it everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;The channels-last memory format is one of the few optimizations that costs almost nothing to try and is straightforward to revert if it does not help. For a convolution-heavy vision model running in float16 on tensor-core GPUs, it is worth measuring before you reach for quantization or architectural surgery. What I would try next is combining it with &lt;code&gt;torch.compile&lt;/code&gt; and a CUDA graph capture, then re-profiling to see how much transpose overhead is actually left in the trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html" rel="noopener noreferrer"&gt;PyTorch channels-last memory format tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html" rel="noopener noreferrer"&gt;NVIDIA convolution performance and NHWC layout guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/amp.html" rel="noopener noreferrer"&gt;PyTorch autocast and mixed precision docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/cudnn/latest/index.html" rel="noopener noreferrer"&gt;cuDNN developer guide on tensor layouts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pytorch</category>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Data Science Workload: Giới hạn RAM trên Dell Pro Max 14 MC14250</title>
      <dc:creator>Review Laptop</dc:creator>
      <pubDate>Tue, 23 Jun 2026 07:58:01 +0000</pubDate>
      <link>https://dev.to/hung_phatlaptop_a651fc86/data-science-workload-gioi-han-ram-tren-dell-pro-max-14-mc14250-phe</link>
      <guid>https://dev.to/hung_phatlaptop_a651fc86/data-science-workload-gioi-han-ram-tren-dell-pro-max-14-mc14250-phe</guid>
      <description>&lt;p&gt;Trong lĩnh vực Data Science, việc quản lý tài nguyên hệ thống là một bài toán cân não. Khi bạn chạy đồng thời &lt;strong&gt;Jupyter Notebook&lt;/strong&gt;, xử lý dữ liệu với &lt;strong&gt;pandas&lt;/strong&gt; và huấn luyện mô hình bằng &lt;strong&gt;PyTorch&lt;/strong&gt;, ranh giới giữa "mượt mà" và "Out of Memory (OOM)" trở nên rất mong manh.&lt;/p&gt;

&lt;p&gt;Để kiểm chứng thực tế, mình đã thử nghiệm trên chiếc &lt;a href="https://www.reviewlaptop.vn/dell-pro-max-14-mc14250-workstation-14-inch/" rel="noopener noreferrer"&gt;Dell Pro Max 14 MC14250&lt;/a&gt; với cấu hình Core Ultra 7 255H và 16GB RAM LPCAMM2 LPDDR5x. Mục tiêu là xác định "ceiling" (trần) bộ nhớ khi thực hiện các tác vụ nặng.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thực tế xử lý Dataset lớn với Pandas
&lt;/h2&gt;

&lt;p&gt;Khi load một file CSV có kích thước khoảng 2-3GB, &lt;code&gt;pandas&lt;/code&gt; thường chiếm dụng gấp 3-5 lần dung lượng file gốc do cơ chế lưu trữ kiểu dữ liệu trong bộ nhớ. Với 16GB RAM, nếu bạn không tối ưu hóa bằng cách sử dụng &lt;code&gt;chunksize&lt;/code&gt; hoặc ép kiểu dữ liệu (&lt;code&gt;downcast&lt;/code&gt;), hệ thống sẽ nhanh chóng chạm ngưỡng giới hạn.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Ví dụ load dữ liệu lớn và kiểm tra bộ nhớ
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;large_dataset.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="c1"&gt;# Nếu file gốc 3GB, RAM có thể nhảy vọt lên &amp;gt;10GB ngay lập tức
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  PyTorch Batch Size và Swap Behavior trên iGPU
&lt;/h2&gt;

&lt;p&gt;Khi chuyển sang huấn luyện mô hình với PyTorch sử dụng GPU tích hợp (Intel Arc Pro 140T), bộ nhớ sẽ được chia sẻ chung với RAM hệ thống. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch Size:&lt;/strong&gt; Với các model trung bình, batch size quá lớn sẽ khiến &lt;code&gt;RuntimeError: CUDA out of memory&lt;/code&gt; (hoặc lỗi tương đương trên Intel GPU) xuất hiện nhanh chóng.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swap Behavior:&lt;/strong&gt; Khi vượt ngưỡng 16GB, Windows bắt đầu sử dụng Pagefile (Swap). Lúc này, tốc độ xử lý sẽ giảm thê thảm vì tốc độ truy xuất SSD chậm hơn nhiều so với RAM LPDDR5x. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kết luận thực tế:&lt;/strong&gt; Với cấu hình tiêu chuẩn của Dell Pro Max 14 MC14250, bạn có thể xử lý tốt các dataset dưới 1GB một cách thoải mái. Tuy nhiên, với dữ liệu lớn hơn, việc nâng cấp lên tối đa &lt;strong&gt;64GB RAM&lt;/strong&gt; (nhờ hỗ trợ LPCAMM2) là bước đi bắt buộc để tránh tình trạng nghẽn cổ chai khi chạy workflow Data Science chuyên nghiệp.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Bài viết này là bản tóm tắt kỹ thuật. Xem chi tiết đánh giá tại bài gốc.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dellpromax14</category>
      <category>datascience</category>
      <category>pytorch</category>
      <category>jupyter</category>
    </item>
    <item>
      <title>The SDXL VAE overflow that decoded black images in fp16</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Tue, 23 Jun 2026 05:37:00 +0000</pubDate>
      <link>https://dev.to/elise_moreau/the-sdxl-vae-overflow-that-decoded-black-images-in-fp16-46g6</link>
      <guid>https://dev.to/elise_moreau/the-sdxl-vae-overflow-that-decoded-black-images-in-fp16-46g6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: The SDXL VAE decoder pushes activations past 65504, the max value fp16 can hold, so the last decode step overflows to inf and you get a fully black image. At Photoroom we hit this on roughly 1 in 600 product renders before we caught it. The fix is to upcast only the VAE, or swap in rescaled decoder weights, not to drop the whole pipeline to fp32.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run SDXL-based pipelines for product photography. A customer uploads a sneaker on a kitchen table, we cut it out, then generate a clean studio background around it. Hundreds of thousands of renders a day, mostly on A10G and A100 GPUs, with the UNet in fp16 to keep the per-image latency under our budget.&lt;/p&gt;

&lt;p&gt;The bug showed up as a thin stream of complaints. Black image. No error, no stack trace, no NaN warning in the logs. Just a 1024x1024 PNG of pure black where a render should be.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually happening
&lt;/h2&gt;

&lt;p&gt;I pulled 40 of the failing seeds and replayed them with hooks on every module in the VAE decoder. The UNet output was fine. Latents looked normal, values in the usual range. The decode was where it died.&lt;/p&gt;

&lt;p&gt;To be precise, the overflow lives in the decoder's mid and up blocks. SDXL's VAE has a few residual layers where the post-convolution activations spike hard for certain inputs. fp16 tops out at 65504. I logged a max activation of 3.1e5 inside one of the &lt;code&gt;up_blocks&lt;/code&gt; resblocks on a failing seed. Once a single value hits inf, the following GroupNorm propagates it across the whole feature map, and you decode garbage that clamps to black.&lt;/p&gt;

&lt;p&gt;The nuance here is that it's input-dependent. Most latents never come close to the ceiling. High-contrast scenes with bright speculars, like a glossy bottle on white, are the ones that tip over. That's why our QA never saw it and production did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# hook to catch the overflow as it happens
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;6e4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# fp16 max is 65504
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: max activation &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;named_modules&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_forward_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That printout is what pointed me at the exact resblock instead of guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The options we weighed
&lt;/h2&gt;

&lt;p&gt;There's no single right answer here, and the trade-off is VRAM and latency against correctness. We measured four approaches on the same 500-seed batch.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Fixes overflow&lt;/th&gt;
&lt;th&gt;VAE decode latency&lt;/th&gt;
&lt;th&gt;Extra VRAM&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full pipeline fp32&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+210%&lt;/td&gt;
&lt;td&gt;~2x&lt;/td&gt;
&lt;td&gt;Kills our latency budget&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;force_upcast&lt;/code&gt; VAE to fp32&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+18%&lt;/td&gt;
&lt;td&gt;+1.1 GB&lt;/td&gt;
&lt;td&gt;Only the VAE runs fp32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VAE in bf16&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+6%&lt;/td&gt;
&lt;td&gt;+0.1 GB&lt;/td&gt;
&lt;td&gt;Needs Ampere or newer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp16-fix decoder weights&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;+0%&lt;/td&gt;
&lt;td&gt;+0 GB&lt;/td&gt;
&lt;td&gt;Rescaled weights, fp16 stays&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full fp32 was off the table. It doubled memory and blew past the latency we promise. The other three all hold up.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;force_upcast&lt;/code&gt; is the diffusers default for a reason. It keeps the UNet in fp16 and runs only the VAE in fp32. One flag, and the overflow is gone because fp32 has the headroom.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;diffusers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoencoderKL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StableDiffusionXLPipeline&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StableDiffusionXLPipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stabilityai/stable-diffusion-xl-base-1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vae&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;force_upcast&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# VAE runs fp32, UNet stays fp16
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We landed on bf16 for the VAE on our Ampere fleet. bf16 has the same exponent range as fp32, so the 3.1e5 activation fits without issue, and the decode cost was 6% instead of 18%. On the older A10G boxes that don't get us the bf16 path we wanted, we use the rescaled fp16-fix decoder weights, which shift the activation magnitudes down so they never reach the ceiling in the first place.&lt;/p&gt;

&lt;p&gt;One detail that bit us: if you call &lt;code&gt;pipe.enable_vae_tiling()&lt;/code&gt; for large outputs, the tiling runs before the dtype upcast, so you still need the dtype right. Tiling reduces peak memory, it does not touch the numerical range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the gateway fits
&lt;/h2&gt;

&lt;p&gt;A side note, since people ask how the text side of this connects. Before the diffusion step, we rewrite the user's scene description into a cleaner prompt with an LLM, and we generate alt-text captions after. Those LLM calls go through Bifrost, an open-source gateway that gives us one OpenAI-compatible endpoint with automatic failover across providers. It has nothing to do with the VAE overflow. It just means when one provider has a bad afternoon, the caption step doesn't take the render pipeline down with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;bf16 is not a free win. It has the range of fp32 but only 8 bits of mantissa, fewer than fp16's 10, so you trade overflow safety for a little precision. On our renders the visible difference was nothing, but I would not assume that for every model. Measure SSIM against an fp32 reference before you ship.&lt;/p&gt;

&lt;p&gt;The fp16-fix weights are a community rescaling, not an official release. They work well, and we validated them on 2000 renders, but you're trusting a third-party checkpoint. Pin the exact revision.&lt;/p&gt;

&lt;p&gt;And none of this helps if your latents themselves are out of distribution. We saw two black images that were not VAE overflow at all, they were a bad LoRA producing extreme latents. The hook above tells you which failure you're looking at, so put it in your eval harness, not only in debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/en/api/models/autoencoderkl" rel="noopener noreferrer"&gt;diffusers VAE and &lt;code&gt;force_upcast&lt;/code&gt; docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/madebyollin/sdxl-vae-fp16-fix" rel="noopener noreferrer"&gt;sdxl-vae-fp16-fix rescaled weights&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus" rel="noopener noreferrer"&gt;bfloat16 numerics, the original Google brief&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.01952" rel="noopener noreferrer"&gt;SDXL paper, architecture details&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost gateway&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pytorch</category>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Intro to Computer Vision Code-Along series - S1E0</title>
      <dc:creator>Levente Slajcho</dc:creator>
      <pubDate>Mon, 22 Jun 2026 11:34:38 +0000</pubDate>
      <link>https://dev.to/levente-slajcho/cva-computer-vision-adventure-series-s1e0-intro-3pjn</link>
      <guid>https://dev.to/levente-slajcho/cva-computer-vision-adventure-series-s1e0-intro-3pjn</guid>
      <description>&lt;h1&gt;
  
  
  Motivation
&lt;/h1&gt;

&lt;p&gt;Let me start with a very short story.&lt;/p&gt;

&lt;p&gt;I did my first project involving Computer Vision when I was 15 years old, completely fascinated by technology and by creative solutions to all kinds of problems.&lt;/p&gt;

&lt;p&gt;At the time, I thought it would be cool to turn my PC into a touchscreen device, so I took the naked LCD panel and diffuser layer from an old screen and built them into a cardboard box. I also disassembled my webcam and replaced its RGB filter with a makeshift infrared filter made from the black disk of an old floppy disk. The sketchy IR camera, together with a few IR LEDs, was placed inside the box, and whenever I touched the diffuser, it reflected the IR light back to the camera sensor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbaqhx8gqujo5o2fucmwg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbaqhx8gqujo5o2fucmwg.png" width="640" height="360"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br&gt;I had something like this in mind. Ended up with an 8 inch screen of an old multimedia station for cars | Image source: &lt;a href="https://prototypinginterfaces.com/5-5/" rel="noopener noreferrer"&gt;https://prototypinginterfaces.com/5-5/&lt;/a&gt;
  &lt;p&gt;&lt;/p&gt;

&lt;p&gt;Using CCV from the since then vanished company NUi Group (&lt;a href="https://github.com/nuigroup/ccv2" rel="noopener noreferrer"&gt;https://github.com/nuigroup/ccv2&lt;/a&gt;), I calibrated the four corners of the screen, and together with the TUIO mouse driver, that was enough to track my fingers and use them as multi-touch input.&lt;/p&gt;

&lt;p&gt;I can't really describe what it feels like as a teenager to build a touchscreen PC for exactly $0. That small project opened a huge window for me. It showed me that cameras are not only for recording fun and memorable moments - they can also be used to build things, solve problems, and interact with the world in completely different ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fast forward to 2026
&lt;/h3&gt;

&lt;p&gt;A little over a decade later, I graduated in Media Informatics and Visual Computing, and I now have almost 8 years of combined professional experience in 3D design, product development, and Java development. The first satisfied my love for DIY projects, the latter my love for IT.&lt;/p&gt;

&lt;p&gt;In a way, Computer Vision as my ultimate career goal feels like the combination of those two worlds. Cameras and image processing have a very strong connection to the real world, especially if you consider Computer Vision as part of robotics - and that is exactly the field I am absolutely in love with.&lt;/p&gt;

&lt;p&gt;However, having experience only from my studies is a turn-off for companies looking to hire a Computer Vision Engineer. &lt;/p&gt;

&lt;p&gt;9 years ago I managed to get a 3D designer job with Solidworks just by sitting down to practice all day and all night for only 2 weeks, turning my hobby and personal interest into a profession. Computer Vision is of course a more complex topic, but I am convinced that with the same amount of motivation and enthusiasm the same thing will happen again.&lt;/p&gt;




&lt;h1&gt;
  
  
  About the Computer Vision Code-Along series
&lt;/h1&gt;

&lt;p&gt;Let's climb this mountain together, and follow me if you're interested.&lt;/p&gt;

&lt;p&gt;If you are in a similar situation and looking forward to working in this field and helping the world with your own vision and your computer's vision, stick with me. In this series, I'll be working on three kinds of projects: Kaggle competitions, real-life problems, and totally made-up problems that nobody ever asked a solution for - let’s call those fun projects.&lt;/p&gt;

&lt;p&gt;The focus of every project is to learn something new, gain experience, and overcome problems, whether they are skill issue kind of problems or technical ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to expect and what not to expect
&lt;/h3&gt;

&lt;p&gt;This series is about modern Computer Vision using neural networks in the first season and vision transformers (ViT) in the second season. Some basic, but stable knowledge about traditional Computer Vision methods is required to keep up.&lt;/p&gt;

&lt;p&gt;It is not a shortcut to expertise in modern Computer Vision. Expect a rather slow pace, and don't expect to find the best possible solutions here. That is exactly the point of this series: you're learning with me, but more importantly for yourself. Think, code, debug, experiment, and let others know in the comments if you came up with a different solution.&lt;/p&gt;

&lt;p&gt;Over the next few months - roughly with 1-2 episodes a week -, we'll go through different Computer Vision techniques and work on projects related to them in a learning-by-doing manner. If your learning style is very theory-first, then this series might not be the perfect fit for you - although I still recommend following along, because we'll talk about theory as well.&lt;/p&gt;

&lt;p&gt;You'll also get full transparency into my technical struggles. At first glance, some parts may feel redundant, but these insights are part of this journey. This is not a course, this is a series of blog posts aimed at exploring, learning, trying different paths, and gaining experience in this field. &lt;/p&gt;

&lt;p&gt;If you stay with me until the end, you'll hopefully become the proud owner of a beautiful GitHub repo and gain insight and experience in modern Computer Vision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to start
&lt;/h3&gt;

&lt;p&gt;Depending on your learning style and your starting point, there are different ways to begin, but most importantly, absolutely get familiar with OpenCV.&lt;/p&gt;

&lt;p&gt;If you are completely new to Computer Vision, I strongly recommend building solid foundations in traditional Computer Vision first.&lt;br&gt;
For complete beginners, I also made a small Jupyter Notebook as an appetizer that showcases OpenCV filters using nothing but your webcam, you can find it here: &lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/slelo" rel="noopener noreferrer"&gt;
        slelo
      &lt;/a&gt; / &lt;a href="https://github.com/slelo/CVCA-S1E0-Mini-OpenCV-Playground" rel="noopener noreferrer"&gt;
        CVCA-S1E0-Mini-OpenCV-Playground
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Computer Vision Appetizer for complete beginners
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;About&lt;/h1&gt;

&lt;/div&gt;
&lt;p&gt;This repository is part of the first episode of my newly started Computer Vision Code-Along blog post series. &lt;br&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://dev.to/levente-slajcho/cva-computer-vision-adventure-series-s1e0-intro-3pjn" rel="nofollow"&gt;&lt;img src="https://camo.githubusercontent.com/0b90b95355fe854da0231021755cba147476c7512c3bc525442359ae1afcee43/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f526561642532306d6f72652532306f6e2d6465762e746f2d3041304130413f7374796c653d666f722d7468652d6261646765266c6f676f3d646576646f74746f266c6f676f436f6c6f723d7768697465" alt="Read more on dev.to"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;OpenCV-Filters&lt;/h1&gt;

&lt;/div&gt;
&lt;p&gt;Computer Vision Appetizer for beginners: Simple code with OpenCV filters.
Feel free to explore, experiment, change parameters, and learn by doing.&lt;/p&gt;
&lt;p&gt;Disclaimer: this repo will be updated from time to time&lt;/p&gt;
&lt;a rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com/45306730/611169919-14860987-ded7-43fe-8799-4065793e70ae.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3ODI0NzcwMjcsIm5iZiI6MTc4MjQ3NjcyNywicGF0aCI6Ii80NTMwNjczMC82MTExNjk5MTktMTQ4NjA5ODctZGVkNy00M2ZlLTg3OTktNDA2NTc5M2U3MGFlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA2MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNjI2VDEyMjUyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc5ZjYyMjA3M2MzMTFlMTE1MjMxYTBiZjM0NTliOGMyYzFlNTVlMTc0NjRhMDUwYjdjNmJlYzhkMmE2ZjQxY2EmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JnJlc3BvbnNlLWNvbnRlbnQtdHlwZT1pbWFnZSUyRnBuZyJ9.jA9J0T7OERArVgXoDREUOL-LbnqPQEAxGBXMVltEJB4"&gt;&lt;img width="800" height="512" alt="image" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fprivate-user-images.githubusercontent.com%2F45306730%2F611169919-14860987-ded7-43fe-8799-4065793e70ae.png%3Fjwt%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3ODI0NzcwMjcsIm5iZiI6MTc4MjQ3NjcyNywicGF0aCI6Ii80NTMwNjczMC82MTExNjk5MTktMTQ4NjA5ODctZGVkNy00M2ZlLTg3OTktNDA2NTc5M2U3MGFlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA2MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNjI2VDEyMjUyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc5ZjYyMjA3M2MzMTFlMTE1MjMxYTBiZjM0NTliOGMyYzFlNTVlMTc0NjRhMDUwYjdjNmJlYzhkMmE2ZjQxY2EmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JnJlc3BvbnNlLWNvbnRlbnQtdHlwZT1pbWFnZSUyRnBuZyJ9.jA9J0T7OERArVgXoDREUOL-LbnqPQEAxGBXMVltEJB4" class="js-gh-image-fallback"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Requirements&lt;/h2&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Python 3.10 or higher&lt;/li&gt;
&lt;li&gt;OpenCV&lt;/li&gt;
&lt;li&gt;Jupyter Notebook&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Installation&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;This one isn't gonna be too long, just run:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install -r requirements.txt&lt;/pre&gt;

&lt;/div&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/slelo/CVCA-S1E0-Mini-OpenCV-Playground" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;





&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4f6dwa4vihqqt4hw2xci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4f6dwa4vihqqt4hw2xci.png" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br&gt;Example with Canny filter
  &lt;p&gt;&lt;/p&gt;

&lt;p&gt;If you're familiar with this, I wholeheartedly recommend - and kind of require - completing the Deep Learning Specialization by Andrew Ng on Coursera (&lt;a href="https://www.coursera.org/specializations/deep-learning" rel="noopener noreferrer"&gt;https://www.coursera.org/specializations/deep-learning&lt;/a&gt;). It gives you a lot of understanding of what is happening under the hood, and the assignments also make you implement many of those ideas yourself.&lt;/p&gt;

&lt;p&gt;I'll be using PyCharm as a development environment and Python 3.10 and 3.11 by default for compatibility reasons. If we use other tools in later projects, I'll let you know.&lt;/p&gt;




&lt;h1&gt;
  
  
  Foreshadowing
&lt;/h1&gt;

&lt;p&gt;In the next episode, we'll use U-Nets for image segmentation for an inactive Kaggle competition. Until then, you can read more about them here: &lt;a href="https://towardsdatascience.com/understanding-u-net-61276b10f360/" rel="noopener noreferrer"&gt;https://towardsdatascience.com/understanding-u-net-61276b10f360/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Please make sure you have a basic understanding of Convolutional Neural Networks. To build better intuition, I also recommend reading about AlexNet, ResNet, and MobileNet, and learning how they work and why they became so popular (This video and the following ones in the playlist will help: &lt;a href="https://www.youtube.com/watch?v=-bvTzZCEOdM&amp;amp;list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&amp;amp;index=12" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=-bvTzZCEOdM&amp;amp;list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&amp;amp;index=12&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The next episode will be linked here when it's ready.&lt;/p&gt;

&lt;p&gt;Thank you for reading, and your thoughts are more than welcome in the comments.&lt;/p&gt;

</description>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>tensorflow</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>The seam our tiled upscaler left on every 4K product render</title>
      <dc:creator>Elise Moreau</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:51:10 +0000</pubDate>
      <link>https://dev.to/elise_moreau/the-seam-our-tiled-upscaler-left-on-every-4k-product-render-pf5</link>
      <guid>https://dev.to/elise_moreau/the-seam-our-tiled-upscaler-left-on-every-4k-product-render-pf5</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We tile high-res images through our upscaler because a full 4096×4096 pass blows past 24GB of VRAM. For months every render had a faint cross down the middle. The fix was not a bigger GPU. It was admitting that hard tile boundaries break any model with a receptive field, and feathering the overlap with a raised-cosine weight instead of averaging it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At Photoroom I work on the generative side, mostly diffusion for product photography. One of our smaller models is a convolutional upscaler that takes a 1024px cutout and pushes it to print resolution. Nothing exotic. A residual-in-residual dense block network, the kind of thing that has been around since ESRGAN in 2018.&lt;/p&gt;

&lt;p&gt;It worked fine in the notebook. In production, on large images, it left a seam.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a seam actually is
&lt;/h2&gt;

&lt;p&gt;You cannot run a 4096×4096 image through this model on a single 24GB card. So you tile. Cut the image into 512px squares, upscale each, stitch them back. The naive version of this is three lines of code and it is wrong.&lt;/p&gt;

&lt;p&gt;The reason is the receptive field. To be precise, every output pixel near a tile edge was computed from a partial neighborhood. The convolutions on the right edge of the left tile never saw the pixels that lived in the right tile. So the two halves disagreed by a small amount, maybe 2-3 grey levels, and the human eye is very good at finding a straight vertical line of consistent 2-3 level error. On a flat grey studio background it was obvious. On busy texture it hid.&lt;/p&gt;

&lt;p&gt;We measured it. Sampling 200 renders, the mean absolute difference across the stitch line was 4.1 on an 8-bit scale, versus 0.9 for an adjacent non-seam column. Small number, very visible artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overlap is necessary but not sufficient
&lt;/h2&gt;

&lt;p&gt;The first fix everyone reaches for is overlapping tiles. Take 512px tiles but step by 448, so each pair shares a 64px strip. Then in the shared region you have two predictions and you blend them.&lt;/p&gt;

&lt;p&gt;The nuance here is how you blend. If you average the overlap with a flat 0.5/0.5 weight, you have moved the discontinuity, not removed it. The blend region now has a soft step at each of its two edges where the weighting suddenly kicks in. Better than before. Still a seam, just blurrier.&lt;/p&gt;

&lt;p&gt;What works is a weight that goes smoothly to zero at the tile border, so a pixel contributes nothing exactly where its receptive field ran out. A raised-cosine (Hann) window does this. Each tile is multiplied by its window, the windows are accumulated, and you divide by the summed weight.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hann_2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ramp up over the overlap, flat in the middle, ramp down
&lt;/span&gt;    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ramp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hann_window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;periodic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ramp&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ramp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;   &lt;span class="c1"&gt;# outer product -&amp;gt; 2D
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;blend_tile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;
    &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;[...,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;win&lt;/span&gt;
    &lt;span class="c1"&gt;# caller does canvas / weight.clamp_min(1e-8) at the end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After switching to this, the seam difference dropped from 4.1 to 1.0, statistically indistinguishable from a normal column. Same model weights. Same GPU. Just honest about where each tile's information ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catching it before customers do
&lt;/h2&gt;

&lt;p&gt;The annoying part was that nobody noticed the seam for a while because our eval set was mostly 1024px crops that never tiled. The artifact only existed at the resolution we did not test.&lt;/p&gt;

&lt;p&gt;So we built a regression check on full-size output. For each render we compute the per-column mean absolute gradient and flag any column whose value spikes above its neighbors by more than 3x at a known tile boundary. Cheap, deterministic, runs on CPU.&lt;/p&gt;

&lt;p&gt;For the fuzzier cases (texture seams, slight color drift) we run a vision-language model over a sample of outputs and ask it to describe any visible discontinuity. Those calls go through a gateway, Bifrost, which is one of a few ways we keep provider config and rate limits in one place instead of scattered across scripts. The numeric check catches the obvious ones; the VLM catches the ones a metric misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Seam MAD (8-bit)&lt;/th&gt;
&lt;th&gt;VRAM (4K)&lt;/th&gt;
&lt;th&gt;Extra compute&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single pass&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~31 GB (OOM on 24GB)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard tiles, no overlap&lt;/td&gt;
&lt;td&gt;4.1&lt;/td&gt;
&lt;td&gt;6 GB&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlap + flat average&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;td&gt;+14%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlap + Hann window&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;7 GB&lt;/td&gt;
&lt;td&gt;+16%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Overlap is not free. A 64px overlap on 512px tiles means roughly 16% more pixels get processed, so throughput drops by about that much. Wider overlap blends better and costs more, and past ~96px we saw no further quality gain, only the bill.&lt;/p&gt;

&lt;p&gt;Hann windowing assumes the two predictions in the overlap are both reasonable and close. They usually are for this upscaler. For a diffusion model with stochastic sampling per tile they can diverge enough that blending produces a ghost, and you need a shared noise seed or latent-space tiling instead.&lt;/p&gt;

&lt;p&gt;This also does nothing for semantic seams, where two tiles hallucinate different details. Window blending fixes geometry and color continuity, not content disagreement. That is a harder problem and the honest answer is you tile in latent space or you do not tile at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1809.00219" rel="noopener noreferrer"&gt;ESRGAN: Enhanced Super-Resolution GANs&lt;/a&gt; — the architecture family this upscaler comes from&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2302.08113" rel="noopener noreferrer"&gt;MultiDiffusion&lt;/a&gt; — fusing overlapping diffusion paths, the latent-space version of this idea&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2302.02412" rel="noopener noreferrer"&gt;Mixture of Diffusers&lt;/a&gt; — region-based blending for tiled generation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/docs/stable/generated/torch.hann_window.html" rel="noopener noreferrer"&gt;PyTorch torch.hann_window docs&lt;/a&gt; — the window function used above&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; — the gateway we route eval-time VLM calls through&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>computervision</category>
      <category>pytorch</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Perplexity held flat after INT4. Task accuracy dropped 7 points.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:39:22 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/perplexity-held-flat-after-int4-task-accuracy-dropped-7-points-4fg6</link>
      <guid>https://dev.to/marcuswwchen/perplexity-held-flat-after-int4-task-accuracy-dropped-7-points-4fg6</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost shipped it. A domain eval suite caught a 7-point drop in multi-step task completion that perplexity never saw. Perplexity is a terrible acceptance gate for quantized models.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run model fine-tuning and eval for enterprise agent automation at Nexus Labs. Series B, small team, ten people who touch the eval pipeline. The model in question was a Qwen2.5-14B fine-tune we use for structured workflow execution. Customer-facing. It matters when it's wrong.&lt;/p&gt;

&lt;p&gt;The plan was boring. Quantize to INT4 to fit two replicas on one A100 instead of one, cut serving cost roughly in half. Standard move. We picked GPTQ with a 128 group size, ran calibration on 512 samples from our training distribution, and measured perplexity before and after.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number that lied
&lt;/h2&gt;

&lt;p&gt;Perplexity on our held-out set: 3.81 full precision, 3.85 after INT4. That's a 1% move. Nothing. By the old folklore, a quantization that holds perplexity is a quantization you ship.&lt;/p&gt;

&lt;p&gt;So we ran the actual eval suite. Not perplexity. The 340-case adversarial set we built for this product, where each case is a multi-step task with a programmatic pass/fail check on the final state.&lt;/p&gt;

&lt;p&gt;Task completion went from 81.2% to 74.1%. Seven points. On a metric customers feel directly.&lt;/p&gt;

&lt;p&gt;The failures clustered. Long sequences, six steps or more, where the model had to hold a constraint from step one and apply it at step five. The INT4 model dropped the constraint. Perplexity averages token-level surprise across the whole corpus, so a few critical tokens going wrong in a 400-token trajectory barely move the mean. The eval that scores the trajectory outcome sees it immediately.&lt;/p&gt;

&lt;p&gt;Here is roughly what we measured across the gates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;FP16&lt;/th&gt;
&lt;th&gt;INT4 (GPTQ)&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Perplexity (held-out)&lt;/td&gt;
&lt;td&gt;3.81&lt;/td&gt;
&lt;td&gt;3.85&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMLU (5-shot)&lt;/td&gt;
&lt;td&gt;71.4%&lt;/td&gt;
&lt;td&gt;70.9%&lt;/td&gt;
&lt;td&gt;-0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task completion (our suite)&lt;/td&gt;
&lt;td&gt;81.2%&lt;/td&gt;
&lt;td&gt;74.1%&lt;/td&gt;
&lt;td&gt;-7.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Constraint-retention subset&lt;/td&gt;
&lt;td&gt;88%&lt;/td&gt;
&lt;td&gt;69%&lt;/td&gt;
&lt;td&gt;-19&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MMLU barely moved either. Generic benchmarks were as blind as perplexity here. The damage was concentrated in exactly the capability our product depends on, and only the domain suite measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why averaged metrics miss this
&lt;/h2&gt;

&lt;p&gt;Quantization error isn't uniform. INT4 rounds weights into buckets, and the layers that handle long-range dependency, attention projections deep in the stack, take the error worst. A model can stay fluent token-to-token while losing the thread across a long context. Fluency is what perplexity rewards. Following a constraint across 400 tokens is not fluency.&lt;/p&gt;

&lt;p&gt;The lesson we keep relearning. The model is the easy part. The thing that tells you whether the model is good enough is the hard part, and it's almost never a single scalar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We made the domain suite a hard gate for any inference-level change. Quantization, a vLLM version bump, a new kernel, all of it has to clear the trajectory eval, not perplexity.&lt;/p&gt;

&lt;p&gt;To get clean comparisons we shadow every eval case against two backends at once: the FP16 reference on one endpoint and the candidate INT4 build on another. We route both through Bifrost, our gateway, so the eval harness sends one OpenAI-format request and we fan it to both backends behind the same interface. That removed a class of bugs where prompt formatting drifted between the two test paths and made the diff look bigger than it was.&lt;/p&gt;

&lt;p&gt;The harness itself is dull on purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;GATEWAY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initial_state&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GATEWAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# "ref/qwen-fp16" or "cand/qwen-int4"
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# programmatic pass/fail
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature 0, deterministic check, no LLM judging the output. The check is code that inspects final state. When the pass criterion is itself fuzzy, you can't tell a quantization regression from judge noise, and we'd already been burned by that.&lt;/p&gt;

&lt;p&gt;We didn't abandon INT4. We re-ran with AWQ instead of GPTQ and bumped calibration to 1,024 samples weighted toward long sequences. That landed at 79.3% task completion. Still down from FP16, but inside our 2-point tolerance, so we shipped it with the cost win mostly intact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;A 340-case trajectory suite is expensive. Each full run is about 11 minutes and real GPU time. Perplexity is seconds. We only afford the suite because we gate on it for releases, not every commit.&lt;/p&gt;

&lt;p&gt;This finding is ours, not a law. A model serving short single-turn responses would likely show almost no gap between perplexity and task metrics, because there's no long-range constraint to lose. The wider the gap between your token-level proxy and your actual product behavior, the more this bites.&lt;/p&gt;

&lt;p&gt;Deterministic checks only work when success is checkable in code. Plenty of generation tasks aren't, and there you're stuck with judge models and their variance. We don't pretend INT4 is free either. It cost us 2 points we chose to pay for the throughput.&lt;/p&gt;

&lt;p&gt;And calibration data matters more than the algorithm. Switching GPTQ to AWQ helped, but reweighting calibration toward long sequences helped more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2210.17323" rel="noopener noreferrer"&gt;GPTQ paper (Frantar et al.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2306.00978" rel="noopener noreferrer"&gt;AWQ: Activation-aware Weight Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/quantization/supported_hardware.html" rel="noopener noreferrer"&gt;vLLM quantization docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/transformers/main/en/quantization/overview" rel="noopener noreferrer"&gt;Hugging Face quantization guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Speculative decoding shifted our output distribution and evals missed it</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 18 Jun 2026 06:31:41 +0000</pubDate>
      <link>https://dev.to/marcuswwchen/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it-4dci</link>
      <guid>https://dev.to/marcuswwchen/speculative-decoding-shifted-our-output-distribution-and-evals-missed-it-4dci</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent's tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls.&lt;/p&gt;

&lt;p&gt;So we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free.&lt;/p&gt;

&lt;p&gt;The throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Lossless" is doing a lot of work in that sentence
&lt;/h2&gt;

&lt;p&gt;The vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU.&lt;/p&gt;

&lt;p&gt;Here is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token-by-token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually.&lt;/p&gt;

&lt;p&gt;When the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool-call output where one token flips &lt;code&gt;"limit": 50&lt;/code&gt; to &lt;code&gt;"limit": 500&lt;/code&gt;, it is not invisible at all.&lt;/p&gt;

&lt;p&gt;We measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Serving path&lt;/th&gt;
&lt;th&gt;Exact-match outputs&lt;/th&gt;
&lt;th&gt;Tool-arg mismatch&lt;/th&gt;
&lt;th&gt;Tokens/sec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Target only (no spec)&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spec decode, 1B draft&lt;/td&gt;
&lt;td&gt;98.8%&lt;/td&gt;
&lt;td&gt;1.2%&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spec decode, 3B draft&lt;/td&gt;
&lt;td&gt;99.4%&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per-call divergence compounds. Over a 5-call session that's roughly a 6% chance at least one call drifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why our evals slept through it
&lt;/h2&gt;

&lt;p&gt;This is the part I'm actually annoyed about. Our offline eval suite hit the model directly through the HF &lt;code&gt;generate()&lt;/code&gt; API. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on.&lt;/p&gt;

&lt;p&gt;We were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn't testing the model we served.&lt;/p&gt;

&lt;p&gt;The fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI-compatible endpoint to point both at. The point isn't the tool. The point is your eval requests must traverse the identical decode path, kernels included.&lt;/p&gt;

&lt;p&gt;Here's the config flag that matters in vLLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm serving config&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/nexus-8b-toolcall&lt;/span&gt;
&lt;span class="na"&gt;speculative_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/nexus-1b-draft&lt;/span&gt;
  &lt;span class="na"&gt;num_speculative_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="c1"&gt;# this is the one we missed:&lt;/span&gt;
&lt;span class="c1"&gt;# disable_logprobs_during_spec_decoding defaults vary by version.&lt;/span&gt;
&lt;span class="c1"&gt;# pin it and assert it in CI.&lt;/span&gt;
&lt;span class="na"&gt;speculative_disable_logprobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the eval-side assertion we added so this never ships silently again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fail CI if eval path != serving path
&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nexus-8b-toolcall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spec_decode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# must match prod
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_fingerprint&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;EXPECTED_FINGERPRINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decode path drift: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_fingerprint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things.&lt;/p&gt;

&lt;p&gt;First, we raised the bar on tool-call endpoints specifically. For the two customers running financial workflows, we run target-only, no draft. Slower, exact. They opted in to the cost.&lt;/p&gt;

&lt;p&gt;Second, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%.&lt;/p&gt;

&lt;p&gt;Third, all eval traffic now routes through the production endpoint. No more &lt;code&gt;generate()&lt;/code&gt; in the harness. If the serving path changes, the eval changes with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;This costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means "the model regressed" and "the runtime changed" now look the same on the dashboard. You need the fingerprint to tell them apart.&lt;/p&gt;

&lt;p&gt;The fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion.&lt;/p&gt;

&lt;p&gt;Target-only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either.&lt;/p&gt;

&lt;p&gt;And 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/features/spec_decode.html" rel="noopener noreferrer"&gt;vLLM speculative decoding docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2211.17192" rel="noopener noreferrer"&gt;Leviathan et al., "Fast Inference from Transformers via Speculative Decoding"&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vllm-project/vllm/issues" rel="noopener noreferrer"&gt;vLLM GitHub issues on greedy determinism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/notes/randomness.html" rel="noopener noreferrer"&gt;PyTorch numerical reproducibility notes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Developer Take On: A High-Resolution Neural Cellular Automata</title>
      <dc:creator>Kelvin Kariuki</dc:creator>
      <pubDate>Wed, 17 Jun 2026 11:53:56 +0000</pubDate>
      <link>https://dev.to/kelvin_kariuki_20f4bec616/developer-take-on-a-high-resolution-neural-cellular-automata-111g</link>
      <guid>https://dev.to/kelvin_kariuki_20f4bec616/developer-take-on-a-high-resolution-neural-cellular-automata-111g</guid>
      <description>&lt;h1&gt;
  
  
  Developer Take On: A High-Resolution Neural Cellular Automata
&lt;/h1&gt;

&lt;p&gt;Art has always been a fusion of creativity and mathematics, with each playing off the other to produce breathtaking works. With the advent of machine learning, the line between art and mathematics has further blurred, allowing us to generate stunning visuals that were previously unimaginable. Cellular automata, a mathematical concept first introduced by von Neumann in the 1940s, has been a staple in the world of artificial life and fractal generation. In this article, we'll dive into the world of high-resolution neural cellular automata, exploring the concept, its applications, and implementing it in Python using the PyTorch library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cellular Automata 101
&lt;/h2&gt;

&lt;p&gt;Before we plunge into the world of neural cellular automata, let's quickly cover the basics of cellular automata. In essence, a cellular automaton is a grid of identical cells, each of which can change its state based on a set of predefined rules. These rules are applied simultaneously to all cells, resulting in a global update of the grid in each time step. This process is repeated iteratively, generating a sequence of grids that represent the evolution of the system.&lt;/p&gt;

&lt;p&gt;One of the most well-known examples of a cellular automaton is Conway's Game of Life, in which cells are either alive (1) or dead (0). The rules for updating the grid are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any live cell with two or three live neighbors survives.&lt;/li&gt;
&lt;li&gt;Any dead cell with three live neighbors becomes a live cell.&lt;/li&gt;
&lt;li&gt;All other live cells die in the next generation. Similarly, all other dead cells stay dead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The resulting patterns created by cellular automata can be stunningly beautiful and display complex behavior, making them an attractive field of study for scientists and artists alike.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Neural Cellular Automata?
&lt;/h2&gt;

&lt;p&gt;Neural cellular automata (NCA) is an extension of traditional cellular automata, in which the rules governing the evolution of the grid are learned from a dataset using a neural network. This allows the NCA to automatically discover complex patterns and relationships in the data, resulting in visually striking and often surreal images.&lt;/p&gt;

&lt;p&gt;In essence, the NCA uses a neural network to predict the next state of each cell in the grid based on its current state and the states of its neighboring cells. This prediction is then used to update the grid, resulting in a sequence of grids that represent the evolution of the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-Resolution Neural Cellular Automata
&lt;/h2&gt;

&lt;p&gt;The primary challenge in generating high-resolution NCA images lies in training a deep neural network to accurately predict the next state of each cell in the grid. As the resolution of the grid increases, the number of possible states and transitions between them grows exponentially, making it increasingly difficult for the network to generalize and apply the learned rules.&lt;/p&gt;

&lt;p&gt;To overcome this challenge, we'll employ a technique called " pixel shuffle", which involves downsampling the input grid to a lower resolution and then training the network to predict the next state of each pixel in the downscaled grid. Once the network has been trained, it can be used to generate high-resolution images by simply upsampling the output of the network to the desired resolution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing High-Resolution NCA in PyTorch
&lt;/h2&gt;

&lt;p&gt;Below is a simplified example of how we can implement a high-resolution NCA using the PyTorch library. We'll use a simple 3x3 convolutional neural network to learn the rules governing the evolution of the grid, and apply the pixel shuffle technique to generate high-resolution images.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.optim&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;optim&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Define the PyTorch model
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Conv2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kernel_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;conv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max_pool2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;conv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max_pool2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

&lt;span class="c1"&gt;# Define the dataset
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NCA_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__getitem__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__len__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the model, optimizer, and training data
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NCA&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create the training dataset and data loader
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NCA_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data_loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Train the model
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data_loader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Epoch &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, loss: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate high-resolution images using the trained model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;functional&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale_factor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simplified example, and in practice, you may need to adjust the architecture of the model and the training parameters to suit your specific use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we explored the concept of neural cellular automata and implemented a high-resolution NCA using the PyTorch library. By applying the pixel shuffle technique, we were able to train a deep neural network to generate visually stunning images. This is a highly active area of research, with a wide range of potential applications from art to scientific visualization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/" rel="noopener noreferrer"&gt;PyTorch&lt;/a&gt;: A popular deep learning framework for Python.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.digitalocean.com/" rel="noopener noreferrer"&gt;DigitalOcean&lt;/a&gt;: A cloud platform for deploying and scaling applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that some minor stylistic changes were made as per your request, such as making the text more concise and including example code.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>neuralnetworks</category>
    </item>
    <item>
      <title>Winograd convolutions cost us 2 mAP and we didn't notice for a month</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Wed, 17 Jun 2026 07:22:23 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/winograd-convolutions-cost-us-2-map-and-we-didnt-notice-for-a-month-1b3e</link>
      <guid>https://dev.to/marcorinaldi_ai/winograd-convolutions-cost-us-2-map-and-we-didnt-notice-for-a-month-1b3e</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We turned on Winograd convolution to shave latency off a pedestrian detector running on a Cortex-A53, got a clean 18% speedup, and silently lost 2.1 mAP because the F(4,3) transform overflowed in fp16. The accuracy drop hid inside our aggregate metric for almost a month before a per-distance breakdown caught it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, Winograd convolution is one of those optimisations that looks free. You replace the direct 3x3 convolution with a set of input transforms, elementwise multiplies, and an output transform, and the arithmetic count drops. For F(4,3), the standard tiling, you go from 36 multiplies per output tile down to 16. On paper that's a 2.25x reduction in MACs for your 3x3 layers, and 3x3 is most of a modern backbone.&lt;/p&gt;

&lt;p&gt;We run a small detector on a Cortex-A53 board for an indoor people-counting product, MobileNetV3 backbone, roughly 4.2M params after pruning. The team is three CV engineers and one firmware person. We had a 41ms inference budget and were sitting at 39ms, which is the kind of margin that keeps you up at night.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we turned on
&lt;/h2&gt;

&lt;p&gt;Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# before&lt;/span&gt;
./bench &lt;span class="nt"&gt;--model&lt;/span&gt; det_v3.onnx &lt;span class="nt"&gt;--conv-algo&lt;/span&gt; direct
&lt;span class="c"&gt;# mean 39.1ms  p99 44.0ms&lt;/span&gt;

&lt;span class="c"&gt;# after&lt;/span&gt;
./bench &lt;span class="nt"&gt;--model&lt;/span&gt; det_v3.onnx &lt;span class="nt"&gt;--conv-algo&lt;/span&gt; winograd-f43 &lt;span class="nt"&gt;--precision&lt;/span&gt; fp16
&lt;span class="c"&gt;# mean 32.0ms  p99 35.8ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;18% off the mean, p99 comfortably under budget. We shipped it. Espresso, done, on to the next ticket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it went wrong
&lt;/h2&gt;

&lt;p&gt;The detector's overall mAP on our validation set moved from 0.612 to 0.608. Four thousandths. That's inside the noise we normally see between training runs, so nobody blinked. We pin our eval against a fixed 3,800-image set and a 0.004 wobble is genuinely not signal most days.&lt;/p&gt;

&lt;p&gt;The problem only showed up when a customer reported that the counter undercounted in a large open atrium. People far from the camera, small in the frame, were getting dropped. When we broke mAP down by object size instead of looking at the single number, the picture was ugly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Object size (px)&lt;/th&gt;
&lt;th&gt;mAP direct&lt;/th&gt;
&lt;th&gt;mAP Winograd fp16&lt;/th&gt;
&lt;th&gt;delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;large (&amp;gt;96)&lt;/td&gt;
&lt;td&gt;0.781&lt;/td&gt;
&lt;td&gt;0.779&lt;/td&gt;
&lt;td&gt;-0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;medium (32-96)&lt;/td&gt;
&lt;td&gt;0.644&lt;/td&gt;
&lt;td&gt;0.631&lt;/td&gt;
&lt;td&gt;-0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;small (&amp;lt;32)&lt;/td&gt;
&lt;td&gt;0.402&lt;/td&gt;
&lt;td&gt;0.331&lt;/td&gt;
&lt;td&gt;-0.071&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Small objects lost 7 points. They're a minority of the boxes, so the aggregate barely moved, but for a people counter in a big room they're the whole game.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Winograd ate the small boxes
&lt;/h2&gt;

&lt;p&gt;The F(4,3) output transform has matrix entries that are not small integers. You get values like 1, 1/2, 1/4, 2, and the intermediate accumulations span a wider dynamic range than a direct convolution does. In fp32 this is fine. In fp16, with a 10-bit mantissa, the transform amplifies low-magnitude activations and then the inverse transform has to subtract them back out. Catastrophic cancellation. The features that survive are the high-contrast ones, which correspond to large, well-lit objects. The faint gradient that says "small person at the back of the room" gets rounded into mush.&lt;/p&gt;

&lt;p&gt;We confirmed it by running the exact same weights with Winograd in fp32. Small-object mAP came back to 0.398, basically the direct number. The algorithm wasn't wrong. The algorithm in half precision was wrong for our data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually did
&lt;/h2&gt;

&lt;p&gt;We did not throw Winograd away. We made it selective. The early layers, where the spatial resolution is high and small-object information lives, stayed on direct fp16. The deeper layers, lower resolution and more channels, kept Winograd. That recovered most of the speed without the accuracy hole.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;conv_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;winograd-f43&lt;/span&gt;
  &lt;span class="na"&gt;precision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fp16&lt;/span&gt;
  &lt;span class="na"&gt;overrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# high-res early stages carry small-object signal&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;layers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stem"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage1.*"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage2.0"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;algo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;direct&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;End result: 34.6ms mean, small-object mAP at 0.395. We gave back about 2.6ms versus full Winograd and bought back 6.4 points where it mattered.&lt;/p&gt;

&lt;p&gt;One side note on validation. To trust the size-bucketed numbers we needed clean ground truth on a fresh holdout, and hand-labelling small distant figures is miserable and inconsistent between annotators. We auto-labelled a 600-image holdout with a VLM and had humans only correct it, routing those calls through Bifrost so we could fail over between two providers when one rate-limited us mid-batch. It was one option among a few; the point is the labels were consistent enough to make the per-bucket deltas believable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;This is not a "Winograd bad" post. F(4,3) in fp16 is a perfectly good default for a lot of models, and for a classifier where you only care about top-1 it would probably have been invisible and harmless.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The fix is model- and data-specific. Our small-object sensitivity is what made the fp16 cancellation matter. Your failure mode might be somewhere else entirely.&lt;/li&gt;
&lt;li&gt;Selective per-layer policy adds config surface. Someone has to remember why stage1 is direct, and that comment in the YAML is the only thing standing between you and a future regression.&lt;/li&gt;
&lt;li&gt;We never tried Winograd F(2,3), which has tamer transform coefficients and less numerical risk, at the cost of a smaller MAC reduction. That's the next thing to benchmark.&lt;/li&gt;
&lt;li&gt;The real lesson is about the metric, not the kernel. A single aggregate number hid a 7-point hole for weeks. Bucket your eval by the dimension your product actually cares about.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1509.09308" rel="noopener noreferrer"&gt;Fast Algorithms for Convolutional Neural Networks (Lavin &amp;amp; Gray, 2016)&lt;/a&gt; — the original Winograd-for-CNNs paper&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1803.10986" rel="noopener noreferrer"&gt;Error analysis of Winograd transforms&lt;/a&gt; — numerical stability and why fp16 hurts&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html" rel="noopener noreferrer"&gt;NVIDIA mixed-precision training guide&lt;/a&gt; — dynamic range and fp16 pitfalls&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; — what we used for failover on the relabelling batch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cocodataset.org/#detection-eval" rel="noopener noreferrer"&gt;COCO evaluation metrics&lt;/a&gt; — the size-bucket breakdown we should have looked at sooner&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
