DEV Community

Yaroslav Pristupa
Yaroslav Pristupa

Posted on

Why your GPU reports 75 C while your VRAM is cooking at 105 C – the telemetry gap that kills LLM inference

You've set up a local LLM inference node. The model loads. The first tokens stream in at 20 t/s. Everything looks perfect in Task Manager: GPU utilization at 95%, core temperature at 75°C, fan speed humming along. You walk away for a coffee.

When you return twenty minutes later, the token rate has cratered to 5 t/s. Task Manager still shows 75°C. The GPU utilization is still at 95%. There are no error messages, no crashes, no obvious software failures. The system appears healthy. It isn't.

The problem is a telemetry blind spot baked into every modern operating system. Task Manager, GPU-Z, and most monitoring tools report the GPU core temperature. They don't report the memory junction temperature – the actual thermal reading that determines whether your GDDR6X VRAM modules can sustain high-bandwidth read/write operations. And when you're running a Mixture of Experts model through llama.cpp's -cmoe flag, that memory junction temperature is the only number that matters.

This article breaks down the mechanics of the -cmoe memory split, explains why LLM inference creates a sustained thermal load that gaming never does, and shows you how to query the real temperature delta using Python and the NVIDIA Management Library (NVML). We'll also look at why standard OS monitoring tools are structurally incapable of showing you the data you need to keep your inference nodes stable.

If you're building local AI pipelines on consumer hardware, this is the article that explains why they keep degrading without obvious cause.

The -cmoe flag: what it actually does

When you pass -cmoe to llama.cpp, you're telling the engine to exploit the Mixture of Experts architecture for memory efficiency. Here's what happens under the hood.

Gemma-4 26B is a MoE model with 128 expert sub-networks. At inference time, only 8 experts activate per token. The router network selects which experts handle each input, and the rest stay dormant. This means the model's "active" parameter count is 3.8B, not 26B. The full 26B parameters sit in memory, but you're only touching a fraction of them on each forward pass.

The -cmoe flag splits this memory footprint across two physical locations:

┌─────────────────────────────────────────────────────────────┐
│                 -cmoe MEMORY ALLOCATION                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  SYSTEM RAM (DDR5)                    GPU VRAM (8GB GDDR6X) │
│  ┌─────────────────────┐              ┌────────────────────┐ │
│  │ Expert Weights      │              │ Attention Layers   │ │
│  │ (120 of 128 experts)│              │ (Q, K, V, O)       │ │
│  │ ~11.5 GB            │              │ ~1.2 GB            │ │
│  │                     │              │                    │ │
│  │ Swapped on-demand   │◄────────────►│ Always resident    │ │
│  │ by router network   │  PCIe 4.0    │                    │ │
│  │                     │  ~16 GB/s    │ KV Cache           │ │
│  │                     │              │ ~0.5 GB            │ │
│  └─────────────────────┘              └────────────────────┘ │
│                                                             │
│  Token Generation: 20 t/s sustained                        │
│  Expert Swap Latency: <2ms per token                       │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The attention mechanism runs on every token. It's compute-bound and latency-sensitive. Keeping it in VRAM ensures consistent token generation speed. The expert weights, on the other hand, are memory-bound. They tolerate the PCIe transfer penalty because only 8 of 128 experts need to move per token.

The tradeoff is bandwidth. Every token generation cycle involves:

  1. Loading attention weights from VRAM (sustained read)
  2. Swapping 8 expert weights from system RAM across PCIe (burst write)
  3. Computing the forward pass (GPU compute)
  4. Updating the KV cache in VRAM (sustained write)

This creates a continuous read/write pattern on the VRAM that never rests. And that's where the thermal problem begins.

The constant-write nightmare

Gaming and LLM inference have fundamentally different memory access patterns. This distinction is the root cause of VRAM thermal saturation.

When you play a game, the GPU workload is bursty. The render pipeline fills a frame buffer, swaps to display, and then pauses while the next frame is prepared. The memory bus gets micro-breaks between frames. At 60 FPS, that's a 16-millisecond rest period every frame. The VRAM modules have time to dissipate heat between write operations.

LLM inference doesn't work this way. Every token generation cycle involves sustained, high-frequency read/write operations on the VRAM. There are no frame boundaries, no vsync pauses, no natural break points. The memory bus runs at 100% utilization continuously.

Consider a 60,000 token context window. The KV cache alone consumes hundreds of megabytes of VRAM. Every new token requires:

  • Reading the entire KV cache from VRAM (sustained read)
  • Writing the updated KV cache back to VRAM (sustained write)
  • Reading attention weights (sustained read)
  • Writing expert weight buffers during -cmoe swaps (burst write)

This creates a thermal load that the memory modules were never designed for. GDDR6X chips are optimized for bursty workloads like gaming and 3D rendering. Sustained 100% memory bus utilization generates heat faster than the laptop's shared heat-pipes can dissipate it.

The math is simple. At 20 tokens per second, each token takes 50 milliseconds. During those 50 milliseconds, the VRAM is under constant read/write load. The memory junction temperature rises. At 60 tokens per second (a realistic rate for smaller models), the load is even more intense. The heat accumulates faster than the cooling system can remove it.

After 15-20 minutes, the memory junction hits 105°C. The GPU firmware triggers an emergency thermal protocol. Clock speeds drop 40%. Your 20 t/s token rate becomes 5 t/s. Task Manager still shows 75°C on the GPU core. The VRAM is cooking, and you can't see it.

The Windows telemetry gap

Windows Task Manager exposes GPU metrics through the Windows Management Instrumentation (WMI) interface. The problem is structural: WMI's GPU provider only surfaces the GPU core temperature sensor. It doesn't expose the memory junction temperature sensor, even though the hardware provides it.

This isn't a bug. It's a design limitation. The WMI GPU provider was built for gaming and 3D rendering workloads, where GPU core temperature is the relevant metric. When gaming, the memory junction temperature stays well below throttling limits because the workload is bursty. Microsoft never needed to expose it.

For LLM inference, this creates a critical blind spot. You're monitoring the wrong sensor. The GPU core might sit at 75°C (well within spec) while the memory junction climbs to 105°C (thermal emergency). You have no visibility into the actual bottleneck.

The fix requires bypassing WMI entirely. The NVIDIA Management Library (NVML) provides direct access to all GPU sensors, including the memory junction temperature. You can query it from Python using ctypes.

Here's a minimal example that reads the real thermal state:

# nvml_temperature_monitor.py
# Reads VRAM junction temperature directly via NVML
# Bypasses Windows WMI limitations

import ctypes
import time
from ctypes import c_uint, c_int, c_char_p, POINTER, byref

# Load NVML library
nvml = ctypes.CDLL("nvml.dll")

# NVML constants
NVML_SUCCESS = 0
NVML_TEMPERATURE_GPU = 0
NVML_TEMPERATURE_MEMORY = 1  # Memory junction sensor

def init_nvml():
    """Initialize NVML library"""
    result = nvml.nvmlInit()
    if result != NVML_SUCCESS:
        raise RuntimeError(f"nvmlInit failed: {result}")
    return result

def get_gpu_count():
    """Get number of GPUs"""
    count = c_uint()
    result = nvml.nvmlDeviceGetCount(byref(count))
    if result != NVML_SUCCESS:
        raise RuntimeError(f"nvmlDeviceGetCount failed: {result}")
    return count.value

def get_temperature(device_index, sensor_type):
    """Read temperature from specific sensor"""
    device = c_uint()
    result = nvml.nvmlDeviceGetHandleByIndex(device_index, byref(device))
    if result != NVML_SUCCESS:
        raise RuntimeError(f"nvmlDeviceGetHandleByIndex failed: {result}")

    temp = c_uint()
    result = nvml.nvmlDeviceGetTemperature(device, sensor_type, byref(temp))
    if result != NVML_SUCCESS:
        raise RuntimeError(f"nvmlDeviceGetTemperature failed: {result}")

    return temp.value

def monitor_thermal_delta(interval=1.0, duration=60):
    """Monitor GPU core vs VRAM junction temperature delta"""
    init_nvml()
    gpu_count = get_gpu_count()

    print(f"Monitoring {gpu_count} GPU(s) for {duration}s")
    print(f"{'Time':<8} {'GPU Core':<10} {'VRAM Junction':<15} {'Delta':<8}")
    print("-" * 45)

    start = time.time()
    while time.time() - start < duration:
        for i in range(gpu_count):
            try:
                core_temp = get_temperature(i, NVML_TEMPERATURE_GPU)
                vram_temp = get_temperature(i, NVML_TEMPERATURE_MEMORY)
                delta = vram_temp - core_temp

                timestamp = time.strftime("%H:%M:%S")
                print(f"{timestamp:<8} {core_temp}°C{'':<5} {vram_temp}°C{'':<8} +{delta}°C")

                if vram_temp > 95:
                    print(f"  WARNING: VRAM junction at {vram_temp}°C - throttling imminent!")
            except RuntimeError as e:
                print(f"  GPU {i}: {e}")

        time.sleep(interval)

if __name__ == "__main__":
    monitor_thermal_delta(interval=2.0, duration=30)
Enter fullscreen mode Exit fullscreen mode

The output reveals the telemetry gap that Task Manager hides:

Monitoring 1 GPU(s) for 30s
Time     GPU Core   VRAM Junction  Delta
---------------------------------------------
14:32:01 75°C       92°C           +17°C
14:32:03 75°C       94°C           +19°C
14:32:05 74°C       96°C           +22°C
14:32:07 75°C       98°C           +23°C
14:32:09 74°C       101°C          +27°C
  WARNING: VRAM junction at 101°C - throttling imminent!
14:32:11 73°C       103°C          +30°C
  WARNING: VRAM junction at 103°C - throttling imminent!
14:32:13 72°C       105°C          +33°C
  WARNING: VRAM junction at 105°C - throttling imminent!
Enter fullscreen mode Exit fullscreen mode

The GPU core reads 75°C. The VRAM junction reads 105°C. That 30°C delta is the gap between "system appears healthy" and "thermal emergency protocol triggered." Without NVML, you'd never see it.

Verifying through NVML

The Python ctypes approach works, but it's verbose and error-prone. For production deployments, consider using the pynvml package, which provides a cleaner wrapper around NVML:

# nvml_production_monitor.py
# Production-grade VRAM thermal monitoring with pynvml

from pynvml import *
import time
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger(__name__)

class ThermalMonitor:
    def __init__(self, warning_threshold=95, critical_threshold=105):
        nvmlInit()
        self.warning_threshold = warning_threshold
        self.critical_threshold = critical_threshold
        self.device_count = nvmlDeviceGetCount()

    def get_thermal_state(self, device_index=0):
        """Get complete thermal state for a GPU"""
        handle = nvmlDeviceGetHandleByIndex(device_index)

        # GPU core temperature
        core_temp = nvmlDeviceGetTemperature(handle, NVML_TEMPERATURE_GPU)

        # Memory junction temperature (the one Task Manager hides)
        try:
            vram_temp = nvmlDeviceGetTemperature(handle, NVML_TEMPERATURE_MEMORY)
        except NVMLError:
            # Some GPUs don't expose this sensor
            vram_temp = None

        # GPU utilization
        util = nvmlDeviceGetUtilizationRates(handle)

        # Memory usage
        mem = nvmlDeviceGetMemoryInfo(handle)

        return {
            'core_temp': core_temp,
            'vram_temp': vram_temp,
            'gpu_util': util.gpu,
            'mem_util': util.memory,
            'mem_used_gb': mem.used / (1024**3),
            'mem_total_gb': mem.total / (1024**3)
        }

    def check_thermal_health(self, state):
        """Evaluate thermal health and return status"""
        if state['vram_temp'] is None:
            return 'UNKNOWN', 'VRAM sensor not available'

        delta = state['vram_temp'] - state['core_temp']

        if state['vram_temp'] >= self.critical_threshold:
            return 'CRITICAL', f'VRAM at {state["vram_temp"]}°C - throttling active'
        elif state['vram_temp'] >= self.warning_threshold:
            return 'WARNING', f'VRAM at {state["vram_temp"]}°C - approaching limit'
        elif delta > 25:
            return 'WATCH', f'VRAM delta {delta}°C above core - monitor closely'
        else:
            return 'HEALTHY', f'VRAM at {state["vram_temp"]}°C - nominal'

    def monitor_loop(self, interval=2.0, duration=None):
        """Continuous monitoring loop"""
        start = time.time()

        while duration is None or time.time() - start < duration:
            for i in range(self.device_count):
                state = self.get_thermal_state(i)
                status, message = self.check_thermal_health(state)

                logger.info(
                    f"GPU {i}: core={state['core_temp']}°C "
                    f"vram={state['vram_temp']}°C "
                    f"util={state['gpu_util']}% "
                    f"status={status}"
                )

                if status in ('CRITICAL', 'WARNING'):
                    logger.warning(f"GPU {i}: {message}")

            time.sleep(interval)

    def __del__(self):
        try:
            nvmlShutdown()
        except:
            pass

if __name__ == "__main__":
    monitor = ThermalMonitor(warning_threshold=95, critical_threshold=105)
    monitor.monitor_loop(interval=2.0)
Enter fullscreen mode Exit fullscreen mode

This production-grade monitor exposes the telemetry gap that standard tools hide. The key insight: you need to query the NVML_TEMPERATURE_MEMORY sensor, not NVML_TEMPERATURE_GPU. The former reads the actual memory junction; the latter reads the GPU core die.

The thermal state dictionary gives you everything you need to make informed decisions about your inference workload. If the VRAM junction temperature exceeds 95°C, you're approaching the throttling zone. At 105°C, the firmware takes control and clamps your performance.

For long-running inference nodes, integrate this monitoring into your deployment pipeline. Log the thermal delta over time. If you see a consistent 25°C+ gap between core and VRAM junction, your cooling solution isn't designed for sustained AI workloads. You need either better hardware cooling or software-defined thermal management.

The thermal saturation mechanism

GDDR6X memory modules have a specific thermal behavior that explains why the memory junction temperature diverges from the GPU core temperature during sustained workloads.

The memory junction sensor measures the temperature at the point where the VRAM chips interface with the PCB. This is the hottest part of the memory subsystem. During bursty workloads (gaming), the junction temperature stays close to the GPU core temperature because the heat dissipates during idle periods. During sustained workloads (LLM inference), the junction temperature climbs independently because there are no idle periods.

The thermal path looks like this:

VRAM chips (heat source)
    │
    ▼
Thermal pads (thermal interface)
    │
    ▼
Heat-pipe assembly (shared with GPU core)
    │
    ▼
Heatsink fins (air cooling)
    │
    ▼
Exhaust air
Enter fullscreen mode Exit fullscreen mode

The problem is in the shared heat-pipe assembly. When the GPU core generates heat, the heat-pipes carry it to the fins. When the VRAM generates heat simultaneously, the heat-pipes are already carrying GPU heat. The thermal capacity of the shared assembly is exceeded. Heat accumulates at the memory junction faster than the heat-pipes can transport it.

The GDDR6X thermal emergency protocol triggers at 105°C. This isn't a software limit – it's a hardware firmware threshold. The GPU's internal controller reads the memory junction sensor and, when it exceeds 105°C, clamps clock speeds to prevent permanent hardware damage. The clamping is aggressive: 40% clock speed reduction, which translates directly to your 20 t/s token rate dropping to 5 t/s.

The firmware doesn't care that your GPU core is "cool enough." It reads the memory junction sensor and acts on that data. The core temperature is irrelevant to this decision.

This is why standard monitoring tools create a false sense of security. They show you the core temperature, which stays within spec. They don't show you the junction temperature, which is what actually determines performance. You're flying blind.

Implications for production deployments

If you're running local AI inference nodes in production, thermal management isn't optional. It's a system design requirement.

The standard approach – load the model, monitor GPU utilization, hope for the best – fails after 15-20 minutes. The telemetry gap means you can't see the problem until performance collapses. By then, your inference pipeline is degraded and your users are frustrated.

Production-grade thermal management requires two things:

1. Direct sensor access. Bypass WMI. Query NVML or LibreHardwareMonitor directly. Log the memory junction temperature over time. Set alerts at 95°C (warning) and 105°C (critical).

2. Software-defined duty cycles. Instead of relying on hardware fans to manage thermal load, control the compute stream itself. Introduce millisecond-level pauses that let the VRAM modules cool before they hit the firmware threshold.

This is the approach VRAM Shield takes. Its Pulse Throttling technology introduces controlled pauses in the compute stream:

Without thermal management:
████████████████████████████████████████████████████████
  Continuous VRAM load → 105°C → 5 t/s (throttled)

With Pulse Throttling (90% duty cycle):
██████░██████░██████░██████░██████░██████░██████░██████░
  Load → pause → load → pause → 92°C → 20 t/s (sustained)
Enter fullscreen mode Exit fullscreen mode

The symbols represent micro-pauses where the VRAM cools. The total throughput drops by roughly 10% (you lose the pause time), but the sustained performance stays at 20 t/s instead of crashing to 5 t/s after 15 minutes.

For multi-hour inference sessions, Smart Throttling (Pro) adjusts the duty cycle dynamically based on thermal trends. If the memory junction temperature is rising rapidly, it increases pause frequency preemptively. If it's stable, it reduces pauses to maximize throughput.

The key insight: thermal management for LLM inference isn't about cooling the hardware better. It's about controlling the thermal load at the source. Reduce the sustained read/write operations on VRAM to a level that the existing cooling system can handle. The hardware is capable; it just needs software-defined duty cycles to stay within thermal limits.

Summary & CTA

The stability problem in local LLM inference has a specific, measurable cause: VRAM thermal saturation during sustained memory bus operations. The -cmoe flag in llama.cpp solves the memory capacity problem by splitting MoE expert weights across VRAM and system RAM. But it creates a thermal problem because the sustained read/write operations on VRAM generate heat faster than standard laptop cooling can dissipate it.

The telemetry gap compounds the issue. Task Manager shows GPU core temperature (75°C) but hides memory junction temperature (105°C). Without direct NVML access, you're monitoring the wrong sensor and making decisions based on incomplete data.

The fix is straightforward:

  1. Query NVML directly using Python ctypes or pynvml to read the memory junction temperature
  2. Set thermal thresholds at 95°C (warning) and 105°C (critical)
  3. Implement software-defined duty cycles to control the sustained VRAM load

For production deployments, VRAM Shield provides the thermal management layer that standard OS tools lack. Its Pulse Throttling technology maintains 20 t/s sustained token generation by introducing millisecond-level pauses that keep the memory junction below the firmware threshold.

The memory bus is your real thermal bottleneck. Monitor it directly. Manage it deliberately. Your inference nodes will stay stable.

Get started

Star the VRAM Shield repository on GitHub. Download the portable utility from vramshield.com or the releases page. Integrate the NVML monitoring script into your deployment pipeline. Build inference nodes that don't degrade over time.

The tools exist. The telemetry exists. Use them.

Top comments (0)