plasmon

Posted on Apr 5

If Memory Could Compute, Would We Still Need GPUs?

#semiconductor #hardware #ai #gpu

If Memory Could Compute, Would We Still Need GPUs?

The bottleneck for LLM inference isn't GPU compute. It's memory bandwidth.

A February 2026 ArXiv paper (arXiv:2601.05047) states it plainly: the primary challenges for LLM inference are memory and interconnect, not computation. GPU arithmetic units spend more than half their time idle, waiting for data to arrive.

So flip the paradigm. Compute where the data lives, and data movement disappears. This is the core idea behind Processing-in-Memory (PIM). SK Hynix's AiM is shipping as a commercial product. Samsung announced LPDDR5X-PIM in February 2026. HBM4 integrates logic dies, turning the memory stack itself into a co-processor.

Is the GPU era ending? Short answer: no. But PIM will change LLM inference architecture. How far the change goes, and where it stops — that's what the papers and product data reveal.

The Memory Wall: Why GPUs Sit Idle

LLM inference has two phases with different bottlenecks:

Prefill phase (prompt processing):
  Batch-processes many tokens at once
  Matrix-matrix multiply → Compute-bound
  GPU arithmetic units fully utilized

Decode phase (token generation):
  Generates one token at a time, autoregressively
  KV cache reads → Memory bandwidth-bound
  GPU arithmetic units idle, waiting for data

The problem is the Decode phase. Most LLM inference time is spent in Decode. And Decode is memory bandwidth-limited.

# Memory bandwidth bottleneck on RTX 4060 8GB
rtx_4060_specs = {
    "compute": "15.11 TFLOPS (FP16)",
    "memory_bandwidth": "272 GB/s",
    "required_arithmetic_intensity": "15110 / 272 = 55.6 FLOP/byte",
}

# Actual arithmetic intensity during LLM Decode
llm_decode = {
    "typical_arithmetic_intensity": "1-2 FLOP/byte",
    "bottleneck": "memory bandwidth (272 GB/s wall)",
    "gpu_utilization": "< 5% of compute capacity during decode",
}

# 95%+ of GPU compute sits idle during Decode

The A100 80GB has the same structure. HBM2e bandwidth: 2 TB/s. Compute: 312 TFLOPS. Required arithmetic intensity: 156 FLOP/byte vs. Decode's 1-2 FLOP/byte. Bandwidth falls short by 50-100x.

PIM Principle: Don't Move Data, Move Computation

Traditional architecture:

DRAM/HBM → bus → GPU compute → bus → DRAM/HBM
  Data travels round-trip. Bus bandwidth is the bottleneck.

PIM architecture:

Compute units inside DRAM/HBM → only results output
  Data doesn't move. Computation moves to the data.
  Internal bandwidth is orders of magnitude higher than bus bandwidth.

HBM's internal bandwidth (aggregate across banks) is tens of times higher than external bandwidth. Computing inside HBM without moving data out eliminates the bandwidth wall.

Shipping Products

SK Hynix AiM (Accelerator in Memory):
  - Commercial PIM processor based on GDDR6
  - Compute units per memory bank (AiMX card shipping)
  - Deployed in production environments
  - Specialized for GEMV (matrix-vector multiply)

Samsung LPDDR5X-PIM (announced Feb 2026):
  - In-memory compute in mobile LPDDR5X
  - Significant energy efficiency improvement for edge AI inference (industry estimates: several-fold)
  - Targeting smartphones and edge devices

Samsung/SK Hynix HBM4 plans:
  - Logic die integrated into HBM stack
  - Memory stack becomes a co-processor
  - Mass production from February 2026
  - Targeting NVIDIA's "Rubin" architecture

How PIM Changes LLM Inference

2025-2026 ArXiv papers propose concrete PIM × LLM architectures.

HPIM: Heterogeneous PIM Integration (arXiv:2509.12993)

HPIM (Heterogeneous PIM) Architecture:

SRAM-PIM (low latency):
  - Attention score computation
  - Small but ultra-fast
  - Equivalent to GPU L2 cache position

HBM-PIM (high bandwidth, large capacity):
  - KV cache storage and processing
  - Large capacity, medium speed
  - Equivalent to main memory position

Parallel execution of both:
  SRAM-PIM: attention score ← low latency
  HBM-PIM: KV multiplication ← high bandwidth
  → Parallelizes serial dependencies in autoregressive Decode

This envisions PIM across the entire memory hierarchy. Both cache and main memory contribute computation, each leveraging its strengths.

PAM: Processing Across Memory Hierarchy (arXiv:2602.11521)

PAM (Processing Across Memory):

HBM-PIM:  Hot data (frequent access)
DRAM-PIM: Warm data (moderate access)
SSD-PIM:  Cold data (rare access)

→ Optimize processing location by data temperature
→ Handle long-context LLMs (100K+ tokens)

When the entire model doesn't fit in HBM (an everyday reality for us RTX 4060 8GB users), cross-hierarchy PIM could become an alternative to partial offloading.

Three Reasons PIM Can't Kill GPUs

PIM is compelling, but it won't make GPUs unnecessary.

1. Training Is Off the Table

LLM training is compute-bound. Large matrix multiplications, gradient computation, parameter updates — these need GPU's high compute density. PIM's in-memory units are good for matrix-vector products (GEMV) but are vastly outperformed by GPUs on matrix-matrix products (GEMM).

# Training vs Inference workload characteristics
workload_characteristics = {
    "training": {
        "dominant_op": "GEMM (matrix-matrix multiply)",
        "arithmetic_intensity": "high (100+ FLOP/byte)",
        "bottleneck": "compute",
        "pim_advantage": "none (GEMM is GPU territory)",
    },
    "inference_prefill": {
        "dominant_op": "GEMM (batched)",
        "arithmetic_intensity": "medium-high",
        "bottleneck": "compute (batch-size dependent)",
        "pim_advantage": "limited",
    },
    "inference_decode": {
        "dominant_op": "GEMV (matrix-vector multiply)",
        "arithmetic_intensity": "low (1-2 FLOP/byte)",
        "bottleneck": "memory bandwidth",
        "pim_advantage": "★ significant",
    },
}

PIM's window of advantage is inference Decode only. Training and Prefill are GPU's domain.

2. Immature Programming Model

Leveraging PIM requires explicit programming of data placement and compute mapping. No CUDA-equivalent exists.

GPU:
  Mature software stack (CUDA: 15 years of development)
  Full framework support (PyTorch, TensorFlow, llama.cpp)
  Massive developer ecosystem

PIM:
  Vendor-specific APIs (SK Hynix AiM SDK, Samsung PIM SDK)
  No framework integration
  Small developer community
  Manual memory placement optimization required

Hardware without software is useless. CUDA made GPUs a compute platform. PIM needs its "CUDA moment." It hasn't happened yet.

3. Cost Structure

PIM costs more to manufacture than standard memory. Added die area for compute units, more complex testing, lower yields.

Standard HBM3E: ~$10-18/GB (2026 market estimates)
HBM-PIM:        ~$20-30/GB (estimated, compute unit premium)
GPU (A100):     ~$10,000 (includes 80GB HBM2e)

PIM ROI conditions:
  Inference server power cost savings > PIM premium
  → May work at datacenter scale
  → Not relevant for individual users in the near term

Will PIM Reach RTX 4060 Users?

Honestly, consumer PIM is 3-5 years away.

Currently available PIM:
  - SK Hynix AiM: Datacenter only, not consumer-purchasable
  - Samsung LPDDR5X-PIM: Mobile only, not in PCs

Expected 2027-2028:
  - Possible PIM integration in HBM4-equipped GPUs
  - NVIDIA Rubin architecture adopts HBM4
  - Whether PIM features ship in consumer GPUs: unknown

Current practical solutions for consumers:
  1. Higher-bandwidth GPUs (RTX 5090: GDDR7 at ~1.8 TB/s)
  2. MoE models to reduce active parameters (lower bandwidth demand)
  3. Speculative decoding to improve effective bandwidth utilization

Indirect benefits exist, though. As datacenter PIM adoption grows, cloud API inference costs drop. Energy efficiency improvements translate to lower per-token pricing.

Where PIM Redraws the Line

PIM won't kill GPUs. But it redraws the boundary of GPU work.

Current boundary:
  GPU = training + inference (everything)
  Memory = data storage only

Post-PIM boundary:
  GPU = training + Prefill (compute-bound)
  PIM = Decode (memory-bound)
  Memory = data storage + Decode computation

When Decode shifts to PIM, GPUs can specialize in Prefill and training. The idle-GPU problem during Decode disappears.

This also shifts semiconductor industry dynamics. Part of the inference market that NVIDIA monopolizes today moves to memory makers (Samsung, SK Hynix, Micron). TrendForce reported in March 2026 that Samsung and SK Hynix are exploring "next-gen AI memory that could challenge NVIDIA."

The Memory Wall Is Crumbling — From the Inside

Back to the opening question: "If memory could compute, would we still need GPUs?"

Yes, we need GPUs. Training and Prefill are non-negotiable. But Decode's lead actor may shift from GPU to PIM.

PIM fundamentally solves the memory bandwidth bottleneck for inference Decode
SK Hynix AiM is shipping, Samsung LPDDR5X-PIM is announced, HBM4 integrates logic dies
But: can't handle training, software stack is immature, carries a cost premium
Consumer PIM is 3-5 years out. MoE + Speculative Decoding remain the practical solution for now

The memory wall isn't being broken from the outside (more bandwidth). It's crumbling from the inside (putting computation in). That wave will take a while to reach individual GPUs, but in datacenters, it's already underway.

References

"Challenges and Research Directions for Large Language Model Inference Hardware" (2026) arXiv:2601.05047
"HPIM: Heterogeneous Processing-In-Memory-based Accelerator for LLM Inference" (2025) arXiv:2509.12993
"PAM: Processing Across Memory Hierarchy" (2026) arXiv:2602.11521
"Memory Is All You Need: Compute-in-Memory Architectures for LLM Inference" (2024) arXiv:2406.08413
TrendForce. "Beyond HBM: Samsung, SK hynix Explore Next-Gen AI Memory" (2026-03-10)
Samsung. "LPDDR5X-PIM for AI Computing" (2026-02)

DEV Community

If Memory Could Compute, Would We Still Need GPUs?

If Memory Could Compute, Would We Still Need GPUs?

The Memory Wall: Why GPUs Sit Idle

PIM Principle: Don't Move Data, Move Computation

Shipping Products

How PIM Changes LLM Inference

HPIM: Heterogeneous PIM Integration (arXiv:2509.12993)

PAM: Processing Across Memory Hierarchy (arXiv:2602.11521)

Three Reasons PIM Can't Kill GPUs

1. Training Is Off the Table

2. Immature Programming Model

3. Cost Structure

Will PIM Reach RTX 4060 Users?

Where PIM Redraws the Line

The Memory Wall Is Crumbling — From the Inside

References

Top comments (0)