DEV Community: pytorch

Debugging Score-P with PyTorch DDP: A Field Guide to CUDA Error 802 and Other Surprises

Paramita Choudhury — Mon, 29 Jun 2026 05:17:28 +0000

When I set out to instrument my multi-GPU DNABERT-2 training runs with Score-P to analyse DDP communication overhead, I expected the hard part to be understanding the traces. Instead, the hard part turned out to be getting Score-P to coexist with PyTorch's torchrun-based DDP launch mechanism at all.

This post documents every error I hit and exactly how I fixed each one - in the hope that the next person trying to trace a PyTorch DDP workload with Score-P doesn't spend two days rediscovering the same root causes.

The setup: DNABERT-2 (117M-parameter genomic transformer), PyTorch 2.1.2, Score-P 8.1 with Python bindings, a SLURM cluster with A100-SXM4-40GB GPUs, 1/4/8-GPU configurations.

A companion post — Where does the time really go in multi-GPU training? — covers what the traces actually revealed once they worked. This post is purely the war stories of getting there.

Background: two things Score-P does that fight PyTorch DDP

The re-exec mechanism. When you run python -m scorep train.py, Score-P does not simply import itself and start tracing. It sets environment variables (including LD_PRELOAD) to load its C measurement library, then re-executes the entire Python process from scratch with those variables in place. Your script effectively starts twice: once as the launcher, once as the instrumented process. Anything that happens before the re-exec - including CUDA initialisation - happens in a process that then exits.

The two-layer CUDA model. CUDA has two separate APIs:

The driver API, used by nvidia-smi, nvmlInit(), etc. - what the kernel module exposes.
The runtime API, used by cudaGetDeviceCount(), torch.cuda.is_available(), torch.zeros(1).cuda() - what PyTorch and Score-P's CUDA adapter both use.

On a freshly allocated SLURM job, the driver API can respond immediately while the runtime API is still initialising - sometimes for tens of seconds. Score-P's C CUDA adapter, loaded via LD_PRELOAD at library-load time, probes the runtime API the moment the process starts. If the runtime isn't ready yet, the probe poisons the CUDA context for that process permanently.

With that context, here are the errors.

Error 1: FP16 ValueError - a hidden CUDA Error 802

Symptom. Training crashed immediately with:

ValueError: FP16 Mixed precision training with AMP or APEX ('--fp16') can only
be used on CUDA devices.

This looked like a config error - I was clearly on a GPU node, nvidia-smi showed four A100s, yet PyTorch claimed no CUDA device.

Real cause. Score-P's C CUDA adapter was loaded via LD_PRELOAD at startup, before PyTorch initialised CUDA. The adapter called cudaGetDeviceCount() while the runtime was still in the cudaErrorSystemNotReady (Error 802) state. That left the CUDA context permanently broken for the process; the later torch.cuda.is_available() returned False, and the FP16 ValueError was just a downstream symptom.

Fix direction. The CUDA runtime must be warmed up - forced to fully initialise - before python -m scorep sets LD_PRELOAD. Once Score-P's C library is loaded, it's too late.

Error 2: the nvidia-smi check was the wrong layer

First attempt. A pre-flight check that polled nvidia-smi until it responded:

for attempt in $(seq 1 10); do
    if nvidia-smi > /dev/null 2>&1 && [ "$ngpus_visible" -eq "$ngpus_expected" ]; then
        echo "CUDA ready after $attempt attempt(s)."
        break
    fi
    sleep 3
done

Why it wasn't enough. nvidia-smi uses the driver API. A successful call only proves the kernel module is responding - it says nothing about whether cudaGetDeviceCount() would succeed. On a fresh job, nvidia-smi passes on attempt 1 while the runtime API is still cudaErrorSystemNotReady. Wrong layer.

Error 3: `assert torch.cuda.is_available()` fails immediately

Second attempt. A Python check inside the warmup loop:

python -c "import torch; assert torch.cuda.is_available(); torch.zeros(1).cuda()"

Why it failed. On a cold node, torch.cuda.is_available() can return False without raising - it just returns False silently. The assert then exits on the very first attempt, before the runtime had time to initialise.

Fix. Drop the assert. Call torch.zeros(1).cuda() directly inside try/except - let the exception be the "not ready yet" signal:

for attempt in $(seq 1 30); do
    if python -c "
import torch, sys
try:
    torch.zeros(1).cuda()
except Exception:
    sys.exit(1)
" 2>/dev/null; then
        echo "CUDA runtime ready after $attempt attempt(s)."
        break
    fi
    echo "CUDA runtime not ready (attempt $attempt/30), sleeping 10s..."
    sleep 10
done

This warmup runs without Score-P active - no LD_PRELOAD, no CUDA adapter - so it forces the runtime to initialise once. Every later process (including the Score-P-instrumented workers) then finds the runtime already warm.

Error 4: a node with a permanently broken CUDA runtime

Symptom. Even with the 30-attempt warmup (5 minutes), all attempts failed on one particular node, while nvidia-smi passed on attempt 1 every time.

Diagnosis. That node had a broken CUDA runtime install: the driver was fine, but cudaGetDeviceCount() never returned. A sysadmin problem, not an application one - and the scheduler kept landing my jobs there because it was first in the queue.

Fix. Exclude the bad node in the SLURM script:

#SBATCH --exclude=<broken_node>

After that, jobs landed on healthy nodes where both checks passed on attempt 1.

Error 5: the `scorep.user` import in DDP worker processes

Symptom. After fixing the node, 4- and 8-GPU runs still failed with Error 802 - this time in the worker processes spawned by torchrun, not the main process.

Cause. I had imported scorep.user at module level:

import scorep.user  # module-level import

When torchrun spawns one worker per GPU, each worker re-imports the module, and import scorep.user triggers Score-P's CUDA adapter init inside each freshly-spawned subprocess - before PyTorch sets up that worker's CUDA context. The parent-shell warmup does not carry into child processes.

Fix. Lazy import: defer import scorep.user until the first training_step(), by which point PyTorch has initialised CUDA for that worker:

# Module level - not imported yet
_scorep_user = None  # None = not yet tried; False = import failed

class ScorePTrainer(transformers.Trainer):
    def training_step(self, model, inputs):
        global _scorep_user
        if _scorep_user is None:
            try:
                import scorep.user
                _scorep_user = scorep.user
            except ImportError:
                _scorep_user = False
        if _scorep_user:
            _scorep_user.region_begin("dnabert_train_step")
        result = super().training_step(model, inputs)
        if _scorep_user:
            _scorep_user.region_end("dnabert_train_step")
        return result

The None guard means the import is attempted exactly once per process, on the first step - after CUDA is ready for that rank.

Error 6: Score-P memory limit exceeded

Symptom. With CUDA kernel tracing on (SCOREP_CUDA_ENABLE=kernel,memcpy,sync):

[Score-P] Warning: Too many memory requested. Score-P supports only up to,
but not including, 4 GiB of total memory per process. Reducing to its maximum value.

I had set SCOREP_TOTAL_MEMORY=4G. Score-P's hard per-process limit is strictly less than 4 GiB - exactly 4G hits the ceiling.

Fix. SCOREP_TOTAL_MEMORY=3500M - under the cap, with room for CUDA kernel traces across 8 ranks.

Error 7: `load_best_model_at_end` strategy conflict

Symptom. The short 50-step trace runs failed instantly:

ValueError: --load_best_model_at_end requires the save and eval strategy to match,
but found Evaluation strategy: NO / Save strategy: STEPS

I'd set --evaluation_strategy no to keep eval passes from distorting the trace timeline, but load_best_model_at_end=True requires matching save/eval strategies.

Fix. There is no "best model" for a 50-step diagnostic run:

--evaluation_strategy no \
--load_best_model_at_end False \
--save_steps 10000 \

Error 8: the trace contained the launcher, not the workers

Symptom. The runs completed, produced an OTF2 trace, and opened cleanly in Vampir - showing a single red bar: ...LocalElasticAgent._invoke_run. No kernels. No NCCL. No training steps. The process filter listed exactly one process.

Cause. The launcher was python -m scorep .../scorep_torchrun.py, where scorep_torchrun.py is just from torch.distributed.run import main; main() - i.e. plain torchrun. Its elastic agent spawns the GPU workers as separate child processes that start fresh python interpreters with no Score-P. Score-P therefore instrumented only the agent - the babysitter - which spends the whole run waiting. On disk the proof was unambiguous: the entire 8-GPU trace held a single process's events, and scorep-score showed only Python (USR) regions - no CUDA type:

$ ls traces/
0.def   0.evt                          # one process, not eight

$ scorep-score profile.cubex
flt  type  max_buf[B]   visits  time[s] time[%]  region
     ALL  16,266,573  625,628   38.47   100.0   ALL
     USR  16,266,302  625,627   38.04    98.9   USR     ← all Python, the agent waiting
  SCOREP        271        1    0.43     1.1   SCOREP

Fix. Stop letting torchrun spawn. Launch each rank yourself in a background loop, each as its own python -m scorep process with its own SCOREP_EXPERIMENT_DIRECTORY=scorep_rank_N, using a single-node rendezvous (MASTER_ADDR=localhost, per-rank RANK/LOCAL_RANK). This is exactly what torchrun does internally - fork N ranks, hand each its identity - except now every rank runs under Score-P. No srun, no spawn.

Error 9: `SCOREP_CUDA_ENABLE` captured zero kernels

Symptom. With per-rank launch working, the traces contained the training process - but zero CUDA kernels. scorep-score showed 98% USR (Python) regions and no CUDA type at all, despite SCOREP_CUDA_ENABLE=kernel,memcpy,sync.

Cause. The Score-P Python wrapper passes unknown flags to scorep-config, whose help is explicit: --cuda|--nocuda … On default cuda instrumentation is disabled. Setting SCOREP_CUDA_ENABLE only configures what the CUDA adapter records - but without --cuda the adapter is never loaded.

Fix. Add --cuda to the launch: python -m scorep --cuda --thread=pthread. A wrinkle: do not add the documented -- script separator - this wrapper version forwards it to scorep-config, which rejects it (Unknown option: '--'). After the fix, a CUDA type appears in scorep-score, with ~106 named GPU kernel regions per rank - and, crucially, the NCCL collectives:

$ scorep-score profile.cubex
flt  type  max_buf[B]     visits  time[s] time[%]  region
     ALL  73,352,737  3,001,818   41.63   100.0   ALL
     USR  73,351,928  2,821,228   36.83    88.5   USR
    CUDA   2,347,618     90,294    3.99     9.6   CUDA   ← GPU kernels now captured

$ scorep-score -r profile.cubex | grep CUDA | sort -k4 -rn | head
  CUDA    700 visits  2.38s  ncclKernel_AllReduce_RING_LL_Sum_float   ← gradient sync
  CUDA     62 visits  0.09s  ncclKernel_AllGather_RING_LL_Sum_int8_t
  CUDA  9,538 visits  0.09s  at::native::unrolled_elementwise_kernel<...>
  CUDA  8,376 visits  0.08s  at::native::elementwise_kernel<128, 4, ...>

Error 10: CUPTI buffer overflow at 8 ranks

Symptom. The 1- and 4-GPU traces were clean, but the 8-GPU run dropped records:

[CUPTI Activity] Dropped 85222 records. Current buffer size: 1048576 bytes
Proposed minimum SCOREP_CUDA_BUFFER=8889000

Cause. Eight ranks each profiling through CUPTI overran the default 1 MB per-process CUDA activity buffer between flushes, silently discarding kernel records.

Fix. Score-P told us the answer in the warning. SCOREP_CUDA_BUFFER=64M for generous headroom. No more dropped records.

Error 11: the NCCL watchdog vs. Score-P shutdown race

Symptom. The 8-GPU run finished training but then 6 of 8 ranks aborted during teardown:

terminate called after throwing an instance of 'c10::Error'
  what():  Should never been called   (dummyHasPrimaryContext)
  ... c10d::ProcessGroupNCCL::ncclCommWatchdog()

The aborts struck after training, killing the process before Score-P flushed its profile - so those ranks left no trace on disk.

Cause. A shutdown race: PyTorch's background NCCL watchdog thread runs its cleanup destructor (which touches the CUDA device) at interpreter exit, at the same time Score-P tears down its CUDA context. Whichever loses, crashes. It's non-deterministic - a later 4-GPU run lost the race where an 8-GPU run had won it.

Fix. Remove the race instead of fighting it: call torch.distributed.destroy_process_group() at the end of train(), so NCCL is torn down cleanly, before the interpreter (and Score-P) begin shutdown. With the process group gone, there's no watchdog destructor left to collide with Score-P's teardown, and all eight ranks flush reliably.

Final state: per-rank GPU traces collected

After all eleven fixes, every DDP rank ran under its own Score-P measurement, capturing each worker's GPU kernels and NCCL communication.

Config	Runtime	Samples/sec	Speedup
1 GPU	478.8 s	75.0	1×
4 GPU	103.8 s	345.9	4.61×
8 GPU	52.8 s	680.7	9.08×

On 1 GPU there is zero NCCL; from 4 GPUs the gradient AllReduce appears, and by 8 GPUs ncclKernel_AllReduce is the single largest GPU activity (~2.375 s, comparable to the entire backward pass) - yet it overlaps backward compute on a separate CUDA stream, which is why throughput still scales near-linearly. The Master Timeline makes the overlap visible: compute runs on the default stream CUDA[0:7] (CUDA_NULL_STREAM) while ncclKernel_AllReduce runs concurrently on a separate stream CUDA[0:20].

What that decomposition means is the subject of the companion post.

Lessons

Score-P's re-exec is not optional - design around it. Any CUDA init that must happen before Score-P's C adapter loads has to happen before the python -m scorep call in your shell script.
The driver and runtime APIs are different things. nvidia-smi passing is necessary but not sufficient. Test the runtime directly (torch.zeros(1).cuda()), and use try/except, not is_available().
Module-level imports of the Score-P user API break DDP workers. Each rank is a fresh subprocess; lazy-import inside the first method PyTorch guarantees runs after CUDA setup.
A broken node will burn your budget on CUDA timeouts. --exclude it as soon as you spot it.
Separate the diagnostic trace from the full run. CUDA-enabled --max_steps 50 with --evaluation_strategy no gives clean, size-controlled traces; the full run with CUDA off gives robust scaling stats. You need both.
To trace DDP workers, launch them yourself - don't let torchrun spawn. Replace it with a background loop where each rank is its own python -m scorep process. This is the single most important structural fix.
Setting an env var is not the same as loading the adapter. SCOREP_CUDA_ENABLE configures the CUDA adapter; --cuda loads it. Confirm a CUDA type appears in scorep-score.
Profile sums overstate communication - use the timeline for wall-clock truth. NCCL LL kernels busy-wait, and DDP overlaps AllReduce with backward compute on a separate stream, so summed kernel-time double-counts the overlap.
Tear down NCCL cleanly so it doesn't race your profiler at exit. torch.distributed.destroy_process_group() at the end of training removes the watchdog before interpreter shutdown.

The OTF2 traces behind this post were generated on a SLURM cluster (A100-SXM4-40GB) as part of a Score-P performance analysis of DDP training scaling for the DNABERT-2 genomic classifier.

Classifier-free guidance above 7.5 oversaturated our product renders

Elise Moreau — Fri, 26 Jun 2026 05:36:29 +0000

TL;DR: Classifier-free guidance above a scale of ~7.5 pushed our SDXL product renders into oversaturation and clipped highlights. Adding CFG rescale at 0.7 plus dynamic thresholding fixed it with no retraining.

Around 18% of our automated product renders at Photoroom came back with blown-out highlights and oversaturated color once we raised the classifier-free guidance scale from 5.0 to 9.0 on our fine-tuned SDXL pipeline. The higher scale gave us sharper adherence to the prompt, which the catalog team wanted, but white backgrounds shifted toward grey-blue and metallic surfaces lost their specular detail. To be precise, the problem was not the prompt and not the fine-tune. It was the guidance arithmetic itself interacting with the noise schedule, and it is well documented if you know where to look.

What classifier-free guidance actually does

Classifier-free guidance combines two model predictions at each denoising step: one conditioned on the prompt and one unconditioned. The sampler extrapolates along the vector between them, scaled by a guidance weight. A weight of 1.0 means no guidance, and weights of 5 to 9 are typical for SDXL. Higher weights increase prompt adherence at the cost of pushing latents outside the distribution the model was trained on.

The method comes from Ho and Salimans in Classifier-Free Diffusion Guidance. The formula at each step is straightforward: take the unconditional prediction, add the guidance scale times the difference between conditional and unconditional. The nuance here is that this extrapolation has no bound. As you raise the scale, the standard deviation of the guided prediction grows past the statistics the model learned, and that excess energy shows up in the decoded image as clipping.

Why high guidance scales oversaturate

The decoded pixel range is fixed, roughly [-1, 1] before the VAE maps it back to RGB. When guidance inflates the variance of the predicted noise, the resulting latents carry larger magnitudes than the VAE was trained to reconstruct cleanly. Bright regions saturate to pure white, and color channels drift because the per-channel means shift together. We measured this directly: at guidance 9.0 the per-image latent standard deviation was about 1.4x the standard deviation of the conditional prediction alone.

This is the same failure mode the Imagen team described in Photorealistic Text-to-Image Diffusion Models, where high guidance weights produced saturated, unnatural images. Their answer was dynamic thresholding. A second, complementary fix came later from Lin and colleagues in Common Diffusion Noise Schedules and Sample Steps are Flawed, which introduced guidance rescale to bring the guided prediction's variance back in line.

Two fixes that stack: CFG rescale and dynamic thresholding

CFG rescale corrects the standard deviation of the guided prediction toward the conditional prediction, then blends between the corrected and raw versions by a factor. We set that factor to 0.7 after a sweep. Here is the core of what we run inside the sampler loop:

def apply_cfg_rescale(noise_cond, noise_uncond, guidance_scale, guidance_rescale=0.7):
    # standard classifier-free guidance
    noise_cfg = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

    # rescale variance back toward the conditional prediction (Lin et al. 2023)
    std_cond = noise_cond.std(dim=[1, 2, 3], keepdim=True)
    std_cfg = noise_cfg.std(dim=[1, 2, 3], keepdim=True)
    noise_rescaled = noise_cfg * (std_cond / std_cfg)

    # blend corrected and raw so detail is not fully flattened
    return guidance_rescale * noise_rescaled + (1.0 - guidance_rescale) * noise_cfg

Dynamic thresholding works at a different layer. At each step it predicts the clean sample, computes a high percentile of the absolute pixel values (we use the 99.5th), and clamps to that value before renormalizing. The two corrections address different symptoms. Rescale fixes the variance inflation; thresholding clamps the residual outliers that survive. Running both at guidance 9.0 brought our oversaturation rate from 18% to under 2% on a held-out set of 4,000 SKUs.

How we chose the rescale factor

We swept the rescale factor across 0.0, 0.3, 0.5, 0.7, and 1.0 and scored each batch on two axes. The first was a saturation metric: the fraction of pixels with channel values above 0.97 after decoding. The second was CLIP image-text similarity, so we did not trade away the prompt adherence we raised guidance to get. A factor of 1.0 fully matched the conditional variance but flattened contrast on glossy products. A factor of 0.0 left the original problem. The factor of 0.7 held CLIP similarity within 0.4% of the unrescaled run while cutting the saturated-pixel fraction by more than half.

Trade-offs and limitations

CFG rescale adds two standard deviation reductions and an elementwise blend per step. On our pipeline that is well under 1% of step latency, so cost is not the concern. The real trade-off is contrast. At rescale factors above 0.8 we saw glossy and metallic products lose specular punch, which matters for jewelry and electronics catalogs. Dynamic thresholding has its own edge case: on images that are genuinely meant to be bright and high-key, an aggressive percentile clamps legitimate highlights, so we tuned the percentile per product category rather than globally.

There is also a simpler path we rejected. You can lower the guidance scale back to 5.0 and avoid the whole question, but you lose the prompt fidelity the catalog team asked for. The corrections let us keep a scale of 8.0 to 9.0 without the artifacts, which was the actual goal.

Where to go next

If your renders saturate at high classifier-free guidance, measure the per-image latent standard deviation against the conditional-only prediction before reaching for retraining. The fix is almost always at the guidance arithmetic, not the weights. I would start with CFG rescale at 0.7, add dynamic thresholding only if outliers remain, and validate with a saturated-pixel metric alongside CLIP similarity so you do not silently trade away adherence.

ComfyUI 'Torch not compiled with CUDA enabled'? Every Fix That Works on Windows, Linux, and Mac (2026)

Jovan Chan — Wed, 24 Jun 2026 07:06:37 +0000

This article was originally published on runaihome.com

TL;DR: This error means the PyTorch you have installed is the CPU-only build — it literally has no CUDA code compiled in, so it can't see your GPU even though the driver is fine. The fix is never to reinstall CUDA or your GPU driver; it's to uninstall the CPU torch and reinstall the matching cu12x wheel from PyTorch's own index. On an RTX 50-series card you need the cu128 build specifically.

What you'll be able to do after this guide:

Confirm in 10 seconds whether your torch is the CPU build or the GPU build
Reinstall the correct CUDA wheel in both ComfyUI portable and a manual venv install
Pick the right cu124 / cu126 / cu128 wheel for your exact GPU (and know why RTX 50-series is different)

Honest take: 90% of the time this happens because a custom node ran pip install something, pip pulled torch as a dependency, and on Windows the default PyPI torch wheel is CPU-only. You didn't break CUDA — pip quietly swapped your good GPU build for a smaller CPU one. Reinstalling the right wheel takes about three minutes.

What the error actually means

When ComfyUI starts (or the first time it tries to move a model to the GPU) you get a traceback ending in:

File "...\torch\cuda\__init__.py", line 310, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Read it literally: the PyTorch binary you installed was built without CUDA support. PyTorch ships in separate flavors — a CPU-only wheel and several CUDA wheels (cu124, cu126, cu128, etc.). The CPU wheel is a completely different binary with no GPU kernels in it. No driver update, no CUDA Toolkit install, and no environment variable will add CUDA to a CPU wheel. You have to replace the wheel.

This is different from a driver problem. If your NVIDIA driver were missing, nvidia-smi would fail. Run it in a terminal:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 581.xx       Driver Version: 581.xx       CUDA Version: 12.8     |
|   0  NVIDIA GeForce RTX 4070 ...                                            |
+-----------------------------------------------------------------------------+

If nvidia-smi shows your card, your driver is fine and the problem is 100% on the PyTorch side. (The "CUDA Version: 12.8" line here is the maximum CUDA the driver supports, not the version PyTorch needs — a common point of confusion.)

Step 1: Confirm you actually have the CPU build

Before changing anything, prove the diagnosis. ComfyUI portable ships its own Python under python_embeded, so use that exact interpreter — not whatever python resolves to in your PATH. From the ComfyUI_windows_portable folder:

.\python_embeded\python.exe -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

A CPU build prints something like this — note the +cpu suffix and False:

2.8.0+cpu
False

A working GPU build prints a CUDA tag (+cu128) and True:

2.8.0+cu128
True

If you see +cpu or False, this guide fixes you. If you see +cu128 and True but ComfyUI still throws the error, you have two Python environments and ComfyUI is launching the wrong one — skip to the "Two-environments trap" section below.

For a manual (cloned-repo) install, run the same one-liner but activate your venv first, or call the venv's Python directly:

# Windows venv
.\venv\Scripts\python.exe -c "import torch; print(torch.__version__, torch.cuda.is_available())"

# Linux/Mac venv
./venv/bin/python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

Step 2: Pick the right CUDA wheel for your GPU

This is the part people get wrong. The wheel tag (cu124, cu126, cu128) is the CUDA runtime bundled inside the PyTorch wheel. It does not need to match a CUDA Toolkit on your machine — the wheel is self-contained. What it does need to match is your GPU architecture.

Your GPU	Architecture	Wheel to install	Minimum PyTorch
RTX 50-series (5060 Ti / 5070 / 5080 / 5090)	Blackwell, `sm_120`	`cu128`	2.7.0
RTX 40-series (4060 Ti / 4070 / 4080 / 4090)	Ada, `sm_89`	`cu124`, `cu126`, or `cu128`	any current
RTX 30-series (3060 / 3080 / 3090)	Ampere, `sm_86`	`cu124`, `cu126`, or `cu128`	any current

The RTX 50-series is the trap. Blackwell's sm_120 compute capability was only added to stable PyTorch in 2.7.0, which shipped the first pre-built CUDA 12.8 wheels with native Blackwell support. If you install an older cu124 wheel on an RTX 5090, you'll get past this error only to hit CUDA error: no kernel image is available for execution on the device — the sibling problem of running a too-old wheel on a too-new GPU. On a 50-series card, use cu128 and PyTorch 2.7.0 or newer, full stop.

For RTX 30/40-series, any of cu124/cu126/cu128 works; cu128 is the safe current default since it's what ComfyUI's own portable builds ship now.

Step 3: Reinstall — ComfyUI portable (Windows)

From inside the ComfyUI_windows_portable directory, uninstall the bad trio first so pip doesn't try to "keep" the CPU build:

.\python_embeded\python.exe -m pip uninstall -y torch torchvision torchaudio

Then install the CUDA wheel from PyTorch's index. For an RTX 50-series card:

.\python_embeded\python.exe -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Note --index-url, not --extra-index-url. Using --index-url forces pip to pull only from the PyTorch index, which guarantees you get the GPU wheel instead of pip silently falling back to the CPU-only one on PyPI. That fallback is the exact mechanism that broke you in the first place.

Re-run the check from Step 1. You want +cu128 and True. Then launch ComfyUI and the error is gone.

If the download is slow or stalls — the CUDA wheels are large, often 2.5 GB-plus because they bundle the CUDA runtime, cuDNN, and NCCL — let it finish; that size is normal and is the whole reason PyPI defaults to the small CPU wheel on Windows in the first place.

Step 4: Reinstall — manual / venv install (Windows, Linux)

If you cloned the repo and run inside a venv, activate it, then do the same uninstall/reinstall:

# activate first
source venv/bin/activate          # Linux/Mac
.\venv\Scripts\activate           # Windows

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

ComfyUI also publishes a maintained requirements path; if you'd rather follow the project's pinned versions, the official install docs list the current recommended cu128 command for your platform. Either way the principle is identical: uninstall CPU torch, install the cu128 wheel from the PyTorch index.

Why this keeps happening (and how to stop it)

On Windows and macOS, the torch package on the default Python Package Index (PyPI) is the CPU-only wheel. PyPI serves the lightweight CPU binary by default to those platforms; the CUDA-enabled wheels live only on PyTorch's own download index. So the moment anything runs a plain pip install torch — or installs a package that lists torch as a dependency without pinning the CUDA build — pip happily grabs the CPU wheel from PyPI and overwrites your working GPU install.

The usual culprit is a custom node. You install some shiny new node, its requirements.txt says torch>=2.x, ComfyUI's "install dependencies" step runs, pip decides your current torch doesn't satisfy something, and it reinstalls from PyPI — CPU build. ComfyUI was fine yesterday and broken today, and you "didn't change anything." You did: a node did.

Two habits prevent the relapse:

When installing custom-node requirements, never let pip touch torch. If a node's requirements pull torch, install the node

Using the channels-last memory format reduced the latency of our conversation backbone by 22%

Elise Moreau — Wed, 24 Jun 2026 05:36:21 +0000

TL;DR: Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.

Our background-removal model at Photoroom spent roughly 31 ms per 1024x1024 image on an A100, and profiling pointed most of that time at cuDNN convolution kernels rather than our diffusion sampler. The model is a fairly standard U-Net style encoder-decoder, all convolutions, running in float16 under torch.autocast. Before touching the architecture, I wanted to rule out the cheap wins, and the cheapest one turned out to be tensor memory layout. The channels-last memory format gave us most of the speedup we were chasing, and the change fit in a handful of lines. To be precise, the network math is identical; only the byte order of the activations changes.

What channels-last memory format changes

The channels-last memory format stores a 4D activation tensor in NHWC byte order, keeping the channel values for one spatial position contiguous in memory. PyTorch keeps the logical NCHW shape, so your indexing and your model code stay the same. What changes is the stride pattern, which lets cuDNN select kernels that read contiguous channels and run more efficiently on tensor-core hardware.

The default PyTorch layout is NCHW (channels-first), where all of one channel's pixels sit together. NVIDIA's tensor cores prefer the NHWC arrangement for convolutions, as documented in their convolution performance guide. When your tensors arrive in NCHW, cuDNN often inserts transpose passes around each convolution to reshuffle data, and those transposes are pure overhead. Converting once at the input and keeping the format consistent removes that per-layer reshuffling.

Converting a PyTorch model to channels-last

The conversion API has been stable since well before PyTorch 2.3, and the official memory format tutorial covers the details. Two things need the format: the module parameters and the input tensor. If only one of them is channels-last, cuDNN falls back to NCHW kernels and you gain nothing.

import torch

# convert the model's conv weights once, at load time
model = model.to(memory_format=torch.channels_last)

# convert each input batch to match
x = x.to(memory_format=torch.channels_last)

with torch.autocast("cuda", dtype=torch.float16):
    y = model(x)  # output is channels_last; convert back if a
                  # downstream op needs contiguous NCHW

One subtlety worth checking: x.to(memory_format=torch.channels_last) is a no-op on a 3D tensor, so make sure your inputs carry an explicit batch dimension. After the forward pass, the output keeps channels-last strides. If you feed it into an operation that assumes contiguous NCHW, call .contiguous() there rather than reverting the whole pipeline.

Why NHWC is faster on tensor cores

Tensor cores execute matrix-multiply-accumulate on small tiles, and convolutions get lowered to those tile operations. With NHWC layout the channel dimension, which is the contracting dimension of the implicit matmul, is contiguous, so the kernel loads aligned vectors without gathering strided data. The effect grows with channel count. Our deepest encoder blocks at 512 channels saw the largest per-layer improvement, while the early high-resolution layers at 64 channels barely moved.

The gain also depends on precision. Channels-last pairs with float16 or bfloat16, because tensor cores only engage in reduced precision; in pure float32 the kernels often route through CUDA cores where the layout advantage shrinks. We were already running float16 under autocast, so the two optimizations stacked. The nuance here is that channels-last is not a free win in every configuration. It is a win when your convolutions are wide, your precision is reduced, and your hardware has tensor cores.

Measuring the speedup without fooling yourself

A layout change is easy to misattribute, so I measured carefully. I ran 200 warmup iterations, then timed 1000 forward passes with torch.cuda.synchronize() around each measurement window, since CUDA calls are asynchronous and an unsynchronized timer reports queue time rather than kernel time. I also confirmed the output tensors matched the NCHW baseline within float16 tolerance, so I knew I was timing the same computation.

The headline number was a drop from roughly 31 ms to 24 ms per image, about 22% on our A100. On a V100 the same change gave closer to 14%, which tracks with its older tensor-core generation. I would treat any single-number claim with suspicion until you reproduce it on your own shapes; the benefit is real but hardware-dependent and model-dependent.

Trade-offs and limitations

The format is not universally beneficial. Networks dominated by pointwise operations, normalization, or attention rather than spatial convolutions show little or no improvement, because those ops do not hit the cuDNN convolution path that NHWC accelerates. Transformer backbones, for instance, rarely care.

There is also a correctness trap. Mixing layouts inside a model can silently insert transposes that erase the gain, and some custom operators or older third-party layers assume contiguous NCHW and will either copy or error. If you run torch.compile, verify the format survives the traced graph rather than assuming it does. For very small channel counts the conversion overhead can outweigh the kernel savings, so profile before committing it everywhere.

Wrapping up

The channels-last memory format is one of the few optimizations that costs almost nothing to try and is straightforward to revert if it does not help. For a convolution-heavy vision model running in float16 on tensor-core GPUs, it is worth measuring before you reach for quantization or architectural surgery. What I would try next is combining it with torch.compile and a CUDA graph capture, then re-profiling to see how much transpose overhead is actually left in the trace.

Data Science Workload: Giới hạn RAM trên Dell Pro Max 14 MC14250

Review Laptop — Tue, 23 Jun 2026 07:58:01 +0000

Trong lĩnh vực Data Science, việc quản lý tài nguyên hệ thống là một bài toán cân não. Khi bạn chạy đồng thời Jupyter Notebook, xử lý dữ liệu với pandas và huấn luyện mô hình bằng PyTorch, ranh giới giữa "mượt mà" và "Out of Memory (OOM)" trở nên rất mong manh.

Để kiểm chứng thực tế, mình đã thử nghiệm trên chiếc Dell Pro Max 14 MC14250 với cấu hình Core Ultra 7 255H và 16GB RAM LPCAMM2 LPDDR5x. Mục tiêu là xác định "ceiling" (trần) bộ nhớ khi thực hiện các tác vụ nặng.

Thực tế xử lý Dataset lớn với Pandas

Khi load một file CSV có kích thước khoảng 2-3GB, pandas thường chiếm dụng gấp 3-5 lần dung lượng file gốc do cơ chế lưu trữ kiểu dữ liệu trong bộ nhớ. Với 16GB RAM, nếu bạn không tối ưu hóa bằng cách sử dụng chunksize hoặc ép kiểu dữ liệu (downcast), hệ thống sẽ nhanh chóng chạm ngưỡng giới hạn.

import pandas as pd
import numpy as np

# Ví dụ load dữ liệu lớn và kiểm tra bộ nhớ
df = pd.read_csv('large_dataset.csv') 
# Nếu file gốc 3GB, RAM có thể nhảy vọt lên >10GB ngay lập tức
print(df.info())

PyTorch Batch Size và Swap Behavior trên iGPU

Khi chuyển sang huấn luyện mô hình với PyTorch sử dụng GPU tích hợp (Intel Arc Pro 140T), bộ nhớ sẽ được chia sẻ chung với RAM hệ thống.

Batch Size: Với các model trung bình, batch size quá lớn sẽ khiến RuntimeError: CUDA out of memory (hoặc lỗi tương đương trên Intel GPU) xuất hiện nhanh chóng.
Swap Behavior: Khi vượt ngưỡng 16GB, Windows bắt đầu sử dụng Pagefile (Swap). Lúc này, tốc độ xử lý sẽ giảm thê thảm vì tốc độ truy xuất SSD chậm hơn nhiều so với RAM LPDDR5x.

Kết luận thực tế: Với cấu hình tiêu chuẩn của Dell Pro Max 14 MC14250, bạn có thể xử lý tốt các dataset dưới 1GB một cách thoải mái. Tuy nhiên, với dữ liệu lớn hơn, việc nâng cấp lên tối đa 64GB RAM (nhờ hỗ trợ LPCAMM2) là bước đi bắt buộc để tránh tình trạng nghẽn cổ chai khi chạy workflow Data Science chuyên nghiệp.

Bài viết này là bản tóm tắt kỹ thuật. Xem chi tiết đánh giá tại bài gốc.

The SDXL VAE overflow that decoded black images in fp16

Elise Moreau — Tue, 23 Jun 2026 05:37:00 +0000

TL;DR: The SDXL VAE decoder pushes activations past 65504, the max value fp16 can hold, so the last decode step overflows to inf and you get a fully black image. At Photoroom we hit this on roughly 1 in 600 product renders before we caught it. The fix is to upcast only the VAE, or swap in rescaled decoder weights, not to drop the whole pipeline to fp32.

We run SDXL-based pipelines for product photography. A customer uploads a sneaker on a kitchen table, we cut it out, then generate a clean studio background around it. Hundreds of thousands of renders a day, mostly on A10G and A100 GPUs, with the UNet in fp16 to keep the per-image latency under our budget.

The bug showed up as a thin stream of complaints. Black image. No error, no stack trace, no NaN warning in the logs. Just a 1024x1024 PNG of pure black where a render should be.

What was actually happening

I pulled 40 of the failing seeds and replayed them with hooks on every module in the VAE decoder. The UNet output was fine. Latents looked normal, values in the usual range. The decode was where it died.

To be precise, the overflow lives in the decoder's mid and up blocks. SDXL's VAE has a few residual layers where the post-convolution activations spike hard for certain inputs. fp16 tops out at 65504. I logged a max activation of 3.1e5 inside one of the up_blocks resblocks on a failing seed. Once a single value hits inf, the following GroupNorm propagates it across the whole feature map, and you decode garbage that clamps to black.

The nuance here is that it's input-dependent. Most latents never come close to the ceiling. High-contrast scenes with bright speculars, like a glossy bottle on white, are the ones that tip over. That's why our QA never saw it and production did.

import torch

# hook to catch the overflow as it happens
def watch(name):
    def hook(_, __, out):
        m = out.abs().max().item()
        if m > 6e4:  # fp16 max is 65504
            print(f"{name}: max activation {m:.1f}")
    return hook

for n, mod in pipe.vae.decoder.named_modules():
    mod.register_forward_hook(watch(n))

That printout is what pointed me at the exact resblock instead of guessing.

The options we weighed

There's no single right answer here, and the trade-off is VRAM and latency against correctness. We measured four approaches on the same 500-seed batch.

Approach	Fixes overflow	VAE decode latency	Extra VRAM	Notes
Full pipeline fp32	Yes	+210%	~2x	Kills our latency budget
`force_upcast` VAE to fp32	Yes	+18%	+1.1 GB	Only the VAE runs fp32
VAE in bf16	Yes	+6%	+0.1 GB	Needs Ampere or newer
fp16-fix decoder weights	Yes	+0%	+0 GB	Rescaled weights, fp16 stays

Full fp32 was off the table. It doubled memory and blew past the latency we promise. The other three all hold up.

force_upcast is the diffusers default for a reason. It keeps the UNet in fp16 and runs only the VAE in fp32. One flag, and the overflow is gone because fp32 has the headroom.

from diffusers import AutoencoderKL, StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.vae.config.force_upcast = True  # VAE runs fp32, UNet stays fp16

We landed on bf16 for the VAE on our Ampere fleet. bf16 has the same exponent range as fp32, so the 3.1e5 activation fits without issue, and the decode cost was 6% instead of 18%. On the older A10G boxes that don't get us the bf16 path we wanted, we use the rescaled fp16-fix decoder weights, which shift the activation magnitudes down so they never reach the ceiling in the first place.

One detail that bit us: if you call pipe.enable_vae_tiling() for large outputs, the tiling runs before the dtype upcast, so you still need the dtype right. Tiling reduces peak memory, it does not touch the numerical range.

Where the gateway fits

A side note, since people ask how the text side of this connects. Before the diffusion step, we rewrite the user's scene description into a cleaner prompt with an LLM, and we generate alt-text captions after. Those LLM calls go through Bifrost, an open-source gateway that gives us one OpenAI-compatible endpoint with automatic failover across providers. It has nothing to do with the VAE overflow. It just means when one provider has a bad afternoon, the caption step doesn't take the render pipeline down with it.

Trade-offs and limitations

bf16 is not a free win. It has the range of fp32 but only 8 bits of mantissa, fewer than fp16's 10, so you trade overflow safety for a little precision. On our renders the visible difference was nothing, but I would not assume that for every model. Measure SSIM against an fp32 reference before you ship.

The fp16-fix weights are a community rescaling, not an official release. They work well, and we validated them on 2000 renders, but you're trusting a third-party checkpoint. Pin the exact revision.

And none of this helps if your latents themselves are out of distribution. We saw two black images that were not VAE overflow at all, they were a bad LoRA producing extreme latents. The hook above tells you which failure you're looking at, so put it in your eval harness, not only in debugging.

Intro to Computer Vision Code-Along series - S1E0

Levente Slajcho — Mon, 22 Jun 2026 11:34:38 +0000

Motivation

Let me start with a very short story.

I did my first project involving Computer Vision when I was 15 years old, completely fascinated by technology and by creative solutions to all kinds of problems.

At the time, I thought it would be cool to turn my PC into a touchscreen device, so I took the naked LCD panel and diffuser layer from an old screen and built them into a cardboard box. I also disassembled my webcam and replaced its RGB filter with a makeshift infrared filter made from the black disk of an old floppy disk. The sketchy IR camera, together with a few IR LEDs, was placed inside the box, and whenever I touched the diffuser, it reflected the IR light back to the camera sensor.

I had something like this in mind. Ended up with an 8 inch screen of an old multimedia station for cars | Image source: https://prototypinginterfaces.com/5-5/

Using CCV from the since then vanished company NUi Group (https://github.com/nuigroup/ccv2), I calibrated the four corners of the screen, and together with the TUIO mouse driver, that was enough to track my fingers and use them as multi-touch input.

I can't really describe what it feels like as a teenager to build a touchscreen PC for exactly $0. That small project opened a huge window for me. It showed me that cameras are not only for recording fun and memorable moments - they can also be used to build things, solve problems, and interact with the world in completely different ways.

Fast forward to 2026

A little over a decade later, I graduated in Media Informatics and Visual Computing, and I now have almost 8 years of combined professional experience in 3D design, product development, and Java development. The first satisfied my love for DIY projects, the latter my love for IT.

In a way, Computer Vision as my ultimate career goal feels like the combination of those two worlds. Cameras and image processing have a very strong connection to the real world, especially if you consider Computer Vision as part of robotics - and that is exactly the field I am absolutely in love with.

However, having experience only from my studies is a turn-off for companies looking to hire a Computer Vision Engineer.

9 years ago I managed to get a 3D designer job with Solidworks just by sitting down to practice all day and all night for only 2 weeks, turning my hobby and personal interest into a profession. Computer Vision is of course a more complex topic, but I am convinced that with the same amount of motivation and enthusiasm the same thing will happen again.

About the Computer Vision Code-Along series

Let's climb this mountain together, and follow me if you're interested.

If you are in a similar situation and looking forward to working in this field and helping the world with your own vision and your computer's vision, stick with me. In this series, I'll be working on three kinds of projects: Kaggle competitions, real-life problems, and totally made-up problems that nobody ever asked a solution for - let’s call those fun projects.

The focus of every project is to learn something new, gain experience, and overcome problems, whether they are skill issue kind of problems or technical ones.

What to expect and what not to expect

This series is about modern Computer Vision using neural networks in the first season and vision transformers (ViT) in the second season. Some basic, but stable knowledge about traditional Computer Vision methods is required to keep up.

It is not a shortcut to expertise in modern Computer Vision. Expect a rather slow pace, and don't expect to find the best possible solutions here. That is exactly the point of this series: you're learning with me, but more importantly for yourself. Think, code, debug, experiment, and let others know in the comments if you came up with a different solution.

Over the next few months - roughly with 1-2 episodes a week -, we'll go through different Computer Vision techniques and work on projects related to them in a learning-by-doing manner. If your learning style is very theory-first, then this series might not be the perfect fit for you - although I still recommend following along, because we'll talk about theory as well.

You'll also get full transparency into my technical struggles. At first glance, some parts may feel redundant, but these insights are part of this journey. This is not a course, this is a series of blog posts aimed at exploring, learning, trying different paths, and gaining experience in this field.

If you stay with me until the end, you'll hopefully become the proud owner of a beautiful GitHub repo and gain insight and experience in modern Computer Vision.

Where to start

Depending on your learning style and your starting point, there are different ways to begin, but most importantly, absolutely get familiar with OpenCV.

If you are completely new to Computer Vision, I strongly recommend building solid foundations in traditional Computer Vision first.
For complete beginners, I also made a small Jupyter Notebook as an appetizer that showcases OpenCV filters using nothing but your webcam, you can find it here:

slelo / CVCA-S1E0-Mini-OpenCV-Playground

Computer Vision Appetizer for complete beginners

About

This repository is part of the first episode of my newly started Computer Vision Code-Along blog post series.

OpenCV-Filters

Computer Vision Appetizer for beginners: Simple code with OpenCV filters. Feel free to explore, experiment, change parameters, and learn by doing.

Disclaimer: this repo will be updated from time to time

Requirements

Python 3.10 or higher
OpenCV
Jupyter Notebook

Installation

This one isn't gonna be too long, just run:

pip install -r requirements.txt

View on GitHub

Example with Canny filter

If you're familiar with this, I wholeheartedly recommend - and kind of require - completing the Deep Learning Specialization by Andrew Ng on Coursera (https://www.coursera.org/specializations/deep-learning). It gives you a lot of understanding of what is happening under the hood, and the assignments also make you implement many of those ideas yourself.

I'll be using PyCharm as a development environment and Python 3.10 and 3.11 by default for compatibility reasons. If we use other tools in later projects, I'll let you know.

Foreshadowing

In the next episode, we'll use U-Nets for image segmentation for an inactive Kaggle competition. Until then, you can read more about them here: https://towardsdatascience.com/understanding-u-net-61276b10f360/

Please make sure you have a basic understanding of Convolutional Neural Networks. To build better intuition, I also recommend reading about AlexNet, ResNet, and MobileNet, and learning how they work and why they became so popular (This video and the following ones in the playlist will help: https://www.youtube.com/watch?v=-bvTzZCEOdM&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=12)

The next episode will be linked here when it's ready.

Thank you for reading, and your thoughts are more than welcome in the comments.

The seam our tiled upscaler left on every 4K product render

Elise Moreau — Fri, 19 Jun 2026 06:51:10 +0000

TL;DR: We tile high-res images through our upscaler because a full 4096×4096 pass blows past 24GB of VRAM. For months every render had a faint cross down the middle. The fix was not a bigger GPU. It was admitting that hard tile boundaries break any model with a receptive field, and feathering the overlap with a raised-cosine weight instead of averaging it.

At Photoroom I work on the generative side, mostly diffusion for product photography. One of our smaller models is a convolutional upscaler that takes a 1024px cutout and pushes it to print resolution. Nothing exotic. A residual-in-residual dense block network, the kind of thing that has been around since ESRGAN in 2018.

It worked fine in the notebook. In production, on large images, it left a seam.

What a seam actually is

You cannot run a 4096×4096 image through this model on a single 24GB card. So you tile. Cut the image into 512px squares, upscale each, stitch them back. The naive version of this is three lines of code and it is wrong.

The reason is the receptive field. To be precise, every output pixel near a tile edge was computed from a partial neighborhood. The convolutions on the right edge of the left tile never saw the pixels that lived in the right tile. So the two halves disagreed by a small amount, maybe 2-3 grey levels, and the human eye is very good at finding a straight vertical line of consistent 2-3 level error. On a flat grey studio background it was obvious. On busy texture it hid.

We measured it. Sampling 200 renders, the mean absolute difference across the stitch line was 4.1 on an 8-bit scale, versus 0.9 for an adjacent non-seam column. Small number, very visible artifact.

Overlap is necessary but not sufficient

The first fix everyone reaches for is overlapping tiles. Take 512px tiles but step by 448, so each pair shares a 64px strip. Then in the shared region you have two predictions and you blend them.

The nuance here is how you blend. If you average the overlap with a flat 0.5/0.5 weight, you have moved the discontinuity, not removed it. The blend region now has a soft step at each of its two edges where the weighting suddenly kicks in. Better than before. Still a seam, just blurrier.

What works is a weight that goes smoothly to zero at the tile border, so a pixel contributes nothing exactly where its receptive field ran out. A raised-cosine (Hann) window does this. Each tile is multiplied by its window, the windows are accumulated, and you divide by the summed weight.

import torch

def hann_2d(size: int, overlap: int) -> torch.Tensor:
    # ramp up over the overlap, flat in the middle, ramp down
    w = torch.ones(size)
    ramp = torch.hann_window(2 * overlap, periodic=False)[:overlap]
    w[:overlap] = ramp
    w[-overlap:] = ramp.flip(0)
    return w[:, None] * w[None, :]   # outer product -> 2D

def blend_tile(canvas, weight, tile, win, y, x):
    h, w = tile.shape[-2:]
    canvas[..., y:y+h, x:x+w] += tile * win
    weight[..., y:y+h, x:x+w] += win
    # caller does canvas / weight.clamp_min(1e-8) at the end

After switching to this, the seam difference dropped from 4.1 to 1.0, statistically indistinguishable from a normal column. Same model weights. Same GPU. Just honest about where each tile's information ends.

Catching it before customers do

The annoying part was that nobody noticed the seam for a while because our eval set was mostly 1024px crops that never tiled. The artifact only existed at the resolution we did not test.

So we built a regression check on full-size output. For each render we compute the per-column mean absolute gradient and flag any column whose value spikes above its neighbors by more than 3x at a known tile boundary. Cheap, deterministic, runs on CPU.

For the fuzzier cases (texture seams, slight color drift) we run a vision-language model over a sample of outputs and ask it to describe any visible discontinuity. Those calls go through a gateway, Bifrost, which is one of a few ways we keep provider config and rate limits in one place instead of scattered across scripts. The numeric check catches the obvious ones; the VLM catches the ones a metric misses.

Comparison

Strategy	Seam MAD (8-bit)	VRAM (4K)	Extra compute
Single pass	0	~31 GB (OOM on 24GB)	baseline
Hard tiles, no overlap	4.1	6 GB	none
Overlap + flat average	2.3	7 GB	+14%
Overlap + Hann window	1.0	7 GB	+16%

Trade-offs and Limitations

Overlap is not free. A 64px overlap on 512px tiles means roughly 16% more pixels get processed, so throughput drops by about that much. Wider overlap blends better and costs more, and past ~96px we saw no further quality gain, only the bill.

Hann windowing assumes the two predictions in the overlap are both reasonable and close. They usually are for this upscaler. For a diffusion model with stochastic sampling per tile they can diverge enough that blending produces a ghost, and you need a shared noise seed or latent-space tiling instead.

This also does nothing for semantic seams, where two tiles hallucinate different details. Window blending fixes geometry and color continuity, not content disagreement. That is a harder problem and the honest answer is you tile in latent space or you do not tile at all.

Perplexity held flat after INT4. Task accuracy dropped 7 points.

Marcus Chen — Fri, 19 Jun 2026 06:39:22 +0000

TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost shipped it. A domain eval suite caught a 7-point drop in multi-step task completion that perplexity never saw. Perplexity is a terrible acceptance gate for quantized models.

We run model fine-tuning and eval for enterprise agent automation at Nexus Labs. Series B, small team, ten people who touch the eval pipeline. The model in question was a Qwen2.5-14B fine-tune we use for structured workflow execution. Customer-facing. It matters when it's wrong.

The plan was boring. Quantize to INT4 to fit two replicas on one A100 instead of one, cut serving cost roughly in half. Standard move. We picked GPTQ with a 128 group size, ran calibration on 512 samples from our training distribution, and measured perplexity before and after.

The number that lied

Perplexity on our held-out set: 3.81 full precision, 3.85 after INT4. That's a 1% move. Nothing. By the old folklore, a quantization that holds perplexity is a quantization you ship.

So we ran the actual eval suite. Not perplexity. The 340-case adversarial set we built for this product, where each case is a multi-step task with a programmatic pass/fail check on the final state.

Task completion went from 81.2% to 74.1%. Seven points. On a metric customers feel directly.

The failures clustered. Long sequences, six steps or more, where the model had to hold a constraint from step one and apply it at step five. The INT4 model dropped the constraint. Perplexity averages token-level surprise across the whole corpus, so a few critical tokens going wrong in a 400-token trajectory barely move the mean. The eval that scores the trajectory outcome sees it immediately.

Here is roughly what we measured across the gates:

Metric	FP16	INT4 (GPTQ)	Delta
Perplexity (held-out)	3.81	3.85	+0.04
MMLU (5-shot)	71.4%	70.9%	-0.5
Task completion (our suite)	81.2%	74.1%	-7.1
Constraint-retention subset	88%	69%	-19

MMLU barely moved either. Generic benchmarks were as blind as perplexity here. The damage was concentrated in exactly the capability our product depends on, and only the domain suite measured it.

Why averaged metrics miss this

Quantization error isn't uniform. INT4 rounds weights into buckets, and the layers that handle long-range dependency, attention projections deep in the stack, take the error worst. A model can stay fluent token-to-token while losing the thread across a long context. Fluency is what perplexity rewards. Following a constraint across 400 tokens is not fluency.

The lesson we keep relearning. The model is the easy part. The thing that tells you whether the model is good enough is the hard part, and it's almost never a single scalar.

What we changed

We made the domain suite a hard gate for any inference-level change. Quantization, a vLLM version bump, a new kernel, all of it has to clear the trajectory eval, not perplexity.

To get clean comparisons we shadow every eval case against two backends at once: the FP16 reference on one endpoint and the candidate INT4 build on another. We route both through Bifrost, our gateway, so the eval harness sends one OpenAI-format request and we fan it to both backends behind the same interface. That removed a class of bugs where prompt formatting drifted between the two test paths and made the diff look bigger than it was.

The harness itself is dull on purpose:

import asyncio, httpx

GATEWAY = "http://localhost:8080/v1/chat/completions"

async def run_case(client, model, case):
    state = case.initial_state
    for step in case.steps:
        r = await client.post(GATEWAY, json={
            "model": model,                 # "ref/qwen-fp16" or "cand/qwen-int4"
            "messages": case.render(state),
            "temperature": 0,
        })
        state = case.apply(state, r.json())
    return case.check(state)               # programmatic pass/fail

async def eval_suite(model, cases):
    async with httpx.AsyncClient(timeout=60) as c:
        results = await asyncio.gather(*[run_case(c, model, x) for x in cases])
    return sum(results) / len(results)

Temperature 0, deterministic check, no LLM judging the output. The check is code that inspects final state. When the pass criterion is itself fuzzy, you can't tell a quantization regression from judge noise, and we'd already been burned by that.

We didn't abandon INT4. We re-ran with AWQ instead of GPTQ and bumped calibration to 1,024 samples weighted toward long sequences. That landed at 79.3% task completion. Still down from FP16, but inside our 2-point tolerance, so we shipped it with the cost win mostly intact.

Trade-offs and limitations

A 340-case trajectory suite is expensive. Each full run is about 11 minutes and real GPU time. Perplexity is seconds. We only afford the suite because we gate on it for releases, not every commit.

This finding is ours, not a law. A model serving short single-turn responses would likely show almost no gap between perplexity and task metrics, because there's no long-range constraint to lose. The wider the gap between your token-level proxy and your actual product behavior, the more this bites.

Deterministic checks only work when success is checkable in code. Plenty of generation tasks aren't, and there you're stuck with judge models and their variance. We don't pretend INT4 is free either. It cost us 2 points we chose to pay for the throughput.

And calibration data matters more than the algorithm. Switching GPTQ to AWQ helped, but reweighting calibration toward long sequences helped more.

Speculative decoding shifted our output distribution and evals missed it

Marcus Chen — Thu, 18 Jun 2026 06:31:41 +0000

TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent's tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path.

I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. The model we fine-tune is a Llama-3.1-8B variant that drives tool calls. Latency matters because each agent turn can chain 4 or 5 calls.

So we enabled speculative decoding. Draft model was a distilled 1B. Target was our 8B. The pitch is simple: the draft proposes tokens, the target verifies them in one forward pass, you accept the longest matching prefix. When acceptance is high you get tokens nearly for free.

The throughput number was real. 1.9x at our batch sizes. The problem was everything we assumed about correctness.

"Lossless" is doing a lot of work in that sentence

The vLLM docs say speculative decoding is lossless for greedy. That is true in exact arithmetic. It is not true in float16 on a GPU.

Here is the thing nobody tells you. The verification step recomputes logits for the drafted tokens in a batched forward pass. The target model alone computes them token-by-token. Different batch shapes, different kernel paths, different reduction order. The argmax usually agrees. Usually.

When the top two logits are within a few thousandths of each other, the batched path and the sequential path can pick different tokens. For most text that is invisible. For structured tool-call output where one token flips "limit": 50 to "limit": 500, it is not invisible at all.

We measured it. Ran the same 2,000 prompts through both paths, greedy, temperature 0.

Serving path	Exact-match outputs	Tool-arg mismatch	Tokens/sec
Target only (no spec)	baseline	0%	41
Spec decode, 1B draft	98.8%	1.2%	78
Spec decode, 3B draft	99.4%	0.6%	64

1.2% of outputs differed. On agent traffic that chains calls, a 1.2% per-call divergence compounds. Over a 5-call session that's roughly a 6% chance at least one call drifts.

Why our evals slept through it

This is the part I'm actually annoyed about. Our offline eval suite hit the model directly through the HF generate() API. No speculative decoding. No batched verification. Our production serving stack ran vLLM with spec decode on.

We were evaluating one numerical path and shipping another. The eval harness was honest about the model it tested. It just wasn't testing the model we served.

The fix was boring and correct: evaluate against the exact serving endpoint. We route all eval traffic through the same gateway the app uses, so the eval client and the production client are indistinguishable to the backend. We use Bifrost in front of our vLLM and external providers, which gave us one OpenAI-compatible endpoint to point both at. The point isn't the tool. The point is your eval requests must traverse the identical decode path, kernels included.

Here's the config flag that matters in vLLM:

# vllm serving config
model: /models/nexus-8b-toolcall
speculative_config:
  model: /models/nexus-1b-draft
  num_speculative_tokens: 5
# this is the one we missed:
# disable_logprobs_during_spec_decoding defaults vary by version.
# pin it and assert it in CI.
speculative_disable_logprobs: false

And the eval-side assertion we added so this never ships silently again:

# fail CI if eval path != serving path
resp = client.chat.completions.create(
    model="nexus-8b-toolcall",
    messages=msgs,
    temperature=0,
    extra_body={"spec_decode": True},  # must match prod
)
assert resp.system_fingerprint == EXPECTED_FINGERPRINT,     f"decode path drift: {resp.system_fingerprint}"

We compute a fingerprint from the serving config (draft model hash, num_speculative_tokens, kernel version) and assert it. If someone bumps vLLM or swaps the draft, CI goes red before the eval numbers are trusted.

What we changed

We kept speculative decoding. The latency win was worth more than 1.2% drift for most of our endpoints. But we did three things.

First, we raised the bar on tool-call endpoints specifically. For the two customers running financial workflows, we run target-only, no draft. Slower, exact. They opted in to the cost.

Second, we started running a nightly divergence canary that replays 500 prompts through both serving paths and alerts if mismatch exceeds 1.5%. This caught a vLLM upgrade that shifted draft acceptance logic and pushed mismatch to 2.1%.

Third, all eval traffic now routes through the production endpoint. No more generate() in the harness. If the serving path changes, the eval changes with it.

Trade-offs and Limitations

This costs you reproducibility. Pinning evals to the serving path means a kernel update can move your eval scores even when the weights are frozen. That is correct, but it means "the model regressed" and "the runtime changed" now look the same on the dashboard. You need the fingerprint to tell them apart.

The fingerprint approach is only as good as what you hash. We hash config, not the actual CUDA kernel binary. A driver update that changes reduction order without changing our config would slip through. The nightly canary is the backstop for that, not the assertion.

Target-only serving for the exact endpoints roughly halved throughput for those customers. We ate that. Bigger draft models shrink the gap but cost more memory and reduce acceptance, so 3B was not a free win either.

And 1.2% is our number, on our model, at our logit margins. A model with sharper output distributions will diverge less. One with flatter logits will diverge more. Measure your own.

Developer Take On: A High-Resolution Neural Cellular Automata

Kelvin Kariuki — Wed, 17 Jun 2026 11:53:56 +0000

Developer Take On: A High-Resolution Neural Cellular Automata

Art has always been a fusion of creativity and mathematics, with each playing off the other to produce breathtaking works. With the advent of machine learning, the line between art and mathematics has further blurred, allowing us to generate stunning visuals that were previously unimaginable. Cellular automata, a mathematical concept first introduced by von Neumann in the 1940s, has been a staple in the world of artificial life and fractal generation. In this article, we'll dive into the world of high-resolution neural cellular automata, exploring the concept, its applications, and implementing it in Python using the PyTorch library.

Cellular Automata 101

Before we plunge into the world of neural cellular automata, let's quickly cover the basics of cellular automata. In essence, a cellular automaton is a grid of identical cells, each of which can change its state based on a set of predefined rules. These rules are applied simultaneously to all cells, resulting in a global update of the grid in each time step. This process is repeated iteratively, generating a sequence of grids that represent the evolution of the system.

One of the most well-known examples of a cellular automaton is Conway's Game of Life, in which cells are either alive (1) or dead (0). The rules for updating the grid are as follows:

Any live cell with two or three live neighbors survives.
Any dead cell with three live neighbors becomes a live cell.
All other live cells die in the next generation. Similarly, all other dead cells stay dead.

The resulting patterns created by cellular automata can be stunningly beautiful and display complex behavior, making them an attractive field of study for scientists and artists alike.

What is Neural Cellular Automata?

Neural cellular automata (NCA) is an extension of traditional cellular automata, in which the rules governing the evolution of the grid are learned from a dataset using a neural network. This allows the NCA to automatically discover complex patterns and relationships in the data, resulting in visually striking and often surreal images.

In essence, the NCA uses a neural network to predict the next state of each cell in the grid based on its current state and the states of its neighboring cells. This prediction is then used to update the grid, resulting in a sequence of grids that represent the evolution of the system.

High-Resolution Neural Cellular Automata

The primary challenge in generating high-resolution NCA images lies in training a deep neural network to accurately predict the next state of each cell in the grid. As the resolution of the grid increases, the number of possible states and transitions between them grows exponentially, making it increasingly difficult for the network to generalize and apply the learned rules.

To overcome this challenge, we'll employ a technique called " pixel shuffle", which involves downsampling the input grid to a lower resolution and then training the network to predict the next state of each pixel in the downscaled grid. Once the network has been trained, it can be used to generate high-resolution images by simply upsampling the output of the network to the desired resolution.

Implementing High-Resolution NCA in PyTorch

Below is a simplified example of how we can implement a high-resolution NCA using the PyTorch library. We'll use a simple 3x3 convolutional neural network to learn the rules governing the evolution of the grid, and apply the pixel shuffle technique to generate high-resolution images.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Define the PyTorch model
class NCA(nn.Module):
    def __init__(self):
        super(NCA, self).__init__()
        self.conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)

    def forward(self, x):
        x = torch.relu(self.conv(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv(x))
        x = torch.max_pool2d(x, 2)
        return x

# Define the dataset
class NCA_dataset(torch.utils.data.Dataset):
    def __init__(self, data, target):
        self.data = data
        self.target = target

    def __getitem__(self, index):
        data = self.data[index]
        target = self.target[index]
        return data, target

    def __len__(self):
        return len(self.data)

# Initialize the model, optimizer, and training data
model = NCA()
optimizer = optim.Adam(model.parameters(), lr=1e-5)
data = torch.randn(100, 3, 64, 64)
target = torch.randn(100, 64, 64)

# Create the training dataset and data loader
dataset = NCA_dataset(data, target)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Train the model
for epoch in range(100):
    for data, target in data_loader:
        output = model(data)
        loss = torch.mean((output - target) ** 2)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}, loss: {loss.item()}')

# Generate high-resolution images using the trained model
model.eval()
data = torch.randn(1, 3, 64, 64)
output = model(data)
image = torch.argmax(output, dim=1)
image = image.unsqueeze(1)
image = torch.nn.functional.upsample(image, scale_factor=2)

This is a simplified example, and in practice, you may need to adjust the architecture of the model and the training parameters to suit your specific use case.

Conclusion

In this article, we explored the concept of neural cellular automata and implemented a high-resolution NCA using the PyTorch library. By applying the pixel shuffle technique, we were able to train a deep neural network to generate visually stunning images. This is a highly active area of research, with a wide range of potential applications from art to scientific visualization.

Resources

PyTorch: A popular deep learning framework for Python.
DigitalOcean: A cloud platform for deploying and scaling applications.

Note that some minor stylistic changes were made as per your request, such as making the text more concise and including example code.

Winograd convolutions cost us 2 mAP and we didn't notice for a month

Marco Rinaldi — Wed, 17 Jun 2026 07:22:23 +0000

TL;DR: We turned on Winograd convolution to shave latency off a pedestrian detector running on a Cortex-A53, got a clean 18% speedup, and silently lost 2.1 mAP because the F(4,3) transform overflowed in fp16. The accuracy drop hid inside our aggregate metric for almost a month before a per-distance breakdown caught it.

So, the thing is, Winograd convolution is one of those optimisations that looks free. You replace the direct 3x3 convolution with a set of input transforms, elementwise multiplies, and an output transform, and the arithmetic count drops. For F(4,3), the standard tiling, you go from 36 multiplies per output tile down to 16. On paper that's a 2.25x reduction in MACs for your 3x3 layers, and 3x3 is most of a modern backbone.

We run a small detector on a Cortex-A53 board for an indoor people-counting product, MobileNetV3 backbone, roughly 4.2M params after pruning. The team is three CV engineers and one firmware person. We had a 41ms inference budget and were sitting at 39ms, which is the kind of margin that keeps you up at night.

What we turned on

Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.

# before
./bench --model det_v3.onnx --conv-algo direct
# mean 39.1ms  p99 44.0ms

# after
./bench --model det_v3.onnx --conv-algo winograd-f43 --precision fp16
# mean 32.0ms  p99 35.8ms

18% off the mean, p99 comfortably under budget. We shipped it. Espresso, done, on to the next ticket.

Where it went wrong

The detector's overall mAP on our validation set moved from 0.612 to 0.608. Four thousandths. That's inside the noise we normally see between training runs, so nobody blinked. We pin our eval against a fixed 3,800-image set and a 0.004 wobble is genuinely not signal most days.

The problem only showed up when a customer reported that the counter undercounted in a large open atrium. People far from the camera, small in the frame, were getting dropped. When we broke mAP down by object size instead of looking at the single number, the picture was ugly.

Object size (px)	mAP direct	mAP Winograd fp16	delta
large (>96)	0.781	0.779	-0.002
medium (32-96)	0.644	0.631	-0.013
small (<32)	0.402	0.331	-0.071

Small objects lost 7 points. They're a minority of the boxes, so the aggregate barely moved, but for a people counter in a big room they're the whole game.

Why Winograd ate the small boxes

The F(4,3) output transform has matrix entries that are not small integers. You get values like 1, 1/2, 1/4, 2, and the intermediate accumulations span a wider dynamic range than a direct convolution does. In fp32 this is fine. In fp16, with a 10-bit mantissa, the transform amplifies low-magnitude activations and then the inverse transform has to subtract them back out. Catastrophic cancellation. The features that survive are the high-contrast ones, which correspond to large, well-lit objects. The faint gradient that says "small person at the back of the room" gets rounded into mush.

We confirmed it by running the exact same weights with Winograd in fp32. Small-object mAP came back to 0.398, basically the direct number. The algorithm wasn't wrong. The algorithm in half precision was wrong for our data.

What we actually did

We did not throw Winograd away. We made it selective. The early layers, where the spatial resolution is high and small-object information lives, stayed on direct fp16. The deeper layers, lower resolution and more channels, kept Winograd. That recovered most of the speed without the accuracy hole.

conv_policy:
  default: winograd-f43
  precision: fp16
  overrides:
    # high-res early stages carry small-object signal
    - layers: ["stem", "stage1.*", "stage2.0"]
      algo: direct

End result: 34.6ms mean, small-object mAP at 0.395. We gave back about 2.6ms versus full Winograd and bought back 6.4 points where it mattered.

One side note on validation. To trust the size-bucketed numbers we needed clean ground truth on a fresh holdout, and hand-labelling small distant figures is miserable and inconsistent between annotators. We auto-labelled a 600-image holdout with a VLM and had humans only correct it, routing those calls through Bifrost so we could fail over between two providers when one rate-limited us mid-batch. It was one option among a few; the point is the labels were consistent enough to make the per-bucket deltas believable.

Trade-offs and Limitations

This is not a "Winograd bad" post. F(4,3) in fp16 is a perfectly good default for a lot of models, and for a classifier where you only care about top-1 it would probably have been invisible and harmless.

The fix is model- and data-specific. Our small-object sensitivity is what made the fp16 cancellation matter. Your failure mode might be somewhere else entirely.
Selective per-layer policy adds config surface. Someone has to remember why stage1 is direct, and that comment in the YAML is the only thing standing between you and a future regression.
We never tried Winograd F(2,3), which has tamer transform coefficients and less numerical risk, at the cost of a smaller MAC reduction. That's the next thing to benchmark.
The real lesson is about the metric, not the kernel. A single aggregate number hid a 7-point hole for weeks. Bucket your eval by the dimension your product actually cares about.

DEV Community: pytorch

Debugging Score-P with PyTorch DDP: A Field Guide to CUDA Error 802 and Other Surprises

Background: two things Score-P does that fight PyTorch DDP

Error 1: FP16 ValueError - a hidden CUDA Error 802

Error 2: the nvidia-smi check was the wrong layer

Error 3: assert torch.cuda.is_available() fails immediately

Error 4: a node with a permanently broken CUDA runtime

Error 5: the scorep.user import in DDP worker processes

Error 6: Score-P memory limit exceeded

Error 7: load_best_model_at_end strategy conflict

Error 8: the trace contained the launcher, not the workers

Error 9: SCOREP_CUDA_ENABLE captured zero kernels

Error 10: CUPTI buffer overflow at 8 ranks

Error 11: the NCCL watchdog vs. Score-P shutdown race

Final state: per-rank GPU traces collected

Lessons

Classifier-free guidance above 7.5 oversaturated our product renders

What classifier-free guidance actually does

Why high guidance scales oversaturate

Two fixes that stack: CFG rescale and dynamic thresholding

How we chose the rescale factor

Trade-offs and limitations

Where to go next

Further reading

ComfyUI 'Torch not compiled with CUDA enabled'? Every Fix That Works on Windows, Linux, and Mac (2026)

What the error actually means

Step 1: Confirm you actually have the CPU build

Step 2: Pick the right CUDA wheel for your GPU

Step 3: Reinstall — ComfyUI portable (Windows)

Step 4: Reinstall — manual / venv install (Windows, Linux)

Why this keeps happening (and how to stop it)

Using the channels-last memory format reduced the latency of our conversation backbone by 22%

What channels-last memory format changes

Converting a PyTorch model to channels-last

Why NHWC is faster on tensor cores

Measuring the speedup without fooling yourself

Trade-offs and limitations

Wrapping up

Further reading

Data Science Workload: Giới hạn RAM trên Dell Pro Max 14 MC14250

Thực tế xử lý Dataset lớn với Pandas

PyTorch Batch Size và Swap Behavior trên iGPU

The SDXL VAE overflow that decoded black images in fp16

What was actually happening

The options we weighed

Where the gateway fits

Trade-offs and limitations

Further Reading

Intro to Computer Vision Code-Along series - S1E0

Motivation

Fast forward to 2026

About the Computer Vision Code-Along series

What to expect and what not to expect

Where to start

slelo / CVCA-S1E0-Mini-OpenCV-Playground

Computer Vision Appetizer for complete beginners

About

OpenCV-Filters

Requirements

Installation

Foreshadowing

The seam our tiled upscaler left on every 4K product render

What a seam actually is

Overlap is necessary but not sufficient

Catching it before customers do

Comparison

Trade-offs and Limitations

Further Reading

Perplexity held flat after INT4. Task accuracy dropped 7 points.

The number that lied

Why averaged metrics miss this

What we changed

Trade-offs and limitations

Further Reading

Speculative decoding shifted our output distribution and evals missed it

"Lossless" is doing a lot of work in that sentence

Why our evals slept through it

What we changed

Trade-offs and Limitations

Further Reading

Error 3: `assert torch.cuda.is_available()` fails immediately

Error 5: the `scorep.user` import in DDP worker processes

Error 7: `load_best_model_at_end` strategy conflict

Error 9: `SCOREP_CUDA_ENABLE` captured zero kernels