DEV Community: vaibhav ahluwalia

Caching Strategies for LLM Systems – Part 4: Grouped-Query Attention for Scalable, Efficient Transformers

vaibhav ahluwalia — Sat, 21 Feb 2026 14:46:18 +0000

"Scaling Large Language Models is no longer about adding more GPUs — it's about designing attention mechanisms that think smarter, not just bigger. Grouped-Query Attention (GQA) is the mathematically elegant solution that balances memory efficiency and expressive power, unlocking practical inference for billion-parameter models."

The Scaling Challenge

Transformer attention is simple to describe but expensive to execute at scale. For sequence length $L$ and embedding dimension $d$ , a transformer with $H$ attention heads maintains:

Q_i, K_i, V_i \in \mathbb{R}^{L \times d_h}, \quad d_h = \frac{d}{H}, \quad i = 1, \dots, H

The KV cache memory grows as:

M_\text{KV}^{\text{MHA}} = H \cdot L \cdot d_h

For large models ( $H = 64$ , $L = 16 K$ , $d_h=128$ ), this cache can exceed several GBs per layer, dominating inference memory and bandwidth.

Multi-Query Attention (MQA) simplifies this:

\text{Attention}_i = \text{softmax}\left(\frac{Q_i K^T}{\sqrt{d_h}}\right) V

Memory-efficient — KV size reduced by factor $H$

Reduced expressiveness — all heads share the same keys/values

The Elegant Middle Ground: Grouped-Query Attention

GQA partitions the $H$ heads into $G$ groups, each sharing one KV pair:

\text{Attention}i = \text{softmax}\left(\frac{Q_i K{g(i)}^T}{\sqrt{d_h}}\right) V_{g(i)}, \quad g(i) \in {1, \dots, G}

Where $1 < G < H$

Memory reduction:

M_\text{KV}^{\text{GQA}} = G \cdot L \cdot d_h = \frac{G}{H} M_\text{KV}^{\text{MHA}}

Extreme cases:

$G = 1$ → MQA
$G = H$ → MHA

Example: $H = 16$ , $G = 4$ → each group of 4 heads shares one KV → 4× smaller KV memory than MHA, while retaining 25% of inter-head diversity.

Intuition: Visualizing GQA

Consider a model with 16 attention heads divided into 4 groups:

Heads:  1 2 3 4 | 5 6 7 8 | 9 10 11 12 | 13 14 15 16
KV:     KV1      KV2         KV3           KV4

Key insights:

Each group attends independently
Memory scales with number of groups ( $G$ ), not total heads ( $H$ )
Expressiveness scales with group size ( $H / G$ )

GQA allows us to interpolate between full MHA (max diversity, max memory) and MQA (minimal memory, minimal diversity).

Quantitative Trade-Offs

Memory Trade-Off: $MKVGQA/MKVMHA=G/HM_\text{KV}^{\text{GQA}} / M_\text{KV}^{\text{MHA}} = G/H$

Smaller values mean larger memory savings.

Expressiveness: Approximately proportional to number of groups ( $G$ )

More groups = more inter-head diversity.

Inference Speed: Near-MQA for $\ll H$ , near-MHA for $\approx H$ )

Trade-off between speed and model quality.

H	G	KV Memory Reduction	Expressiveness
16	1	16×	Low (MQA)
16	4	4×	Medium-High (GQA)
16	16	1×	Max (MHA)

Real-World Applications

Meta LLaMA 2 (70B): GQA reduces KV memory to fit large contexts efficiently
Mistral 7B: Improves inference throughput on GPUs without sacrificing accuracy
Other autoregressive LLMs: Any model with large head counts benefits from GQA

Insight: The larger the head count and sequence length, the more impactful GQA becomes scaling memory savings almost linearly with $H / G$ .

Research Takeaways

Memory-Efficient Scaling: GQA allows multi-billion parameter models to run within practical hardware limits.
Mathematical Trade-Off Framework: $G$ is a tunable parameter controlling memory vs. expressiveness — a quantifiable design principle.
Pretrained Model Adaptation: MHA → GQA conversion via grouped averaging of KV weights, followed by brief fine-tuning.
Efficiency-Aware Architecture: Future LLM design should consider GQA-like mechanisms to optimize bandwidth, memory, and cost.

"The next frontier of AI isn't just bigger models it's smarter, efficiency-first architectures. Grouped-Query Attention exemplifies this approach: mathematically principled, practical for real-world deployment, and critical for scaling intelligent systems without hitting memory walls. The future belongs to those who design with both compute and cognition in mind."

Caching Strategies for LLM Systems (Part 3): Multi-Query Attention and Memory-Efficient Decoding

vaibhav ahluwalia — Sun, 08 Feb 2026 15:51:07 +0000

In Part 2, we saw how KV caching transforms autoregressive decoding by eliminating redundant attention computation. By storing keys and values from previous tokens, transformers reduce per-token compute from quadratic to linear in sequence length.

However, KV caching introduces a new bottleneck.

As models scale, KV cache memory becomes the dominant cost of inference, often exceeding model weights for long contexts. This post examines Multi-Query Attention (MQA)—an architectural modification that directly attacks this memory bottleneck by changing how attention heads share representation.

The Scaling Problem: KV Cache Grows with Head Count

In standard Multi-Head Attention (MHA), each head has its own key and value projections.

For a model with:

$L$ transformer layers
$H$ attention heads
sequence length $T$
head dimension $d_h$

the KV cache memory scales as:

\mathcal{O}(L \cdot H \cdot T \cdot d_h)

KV caching removes redundant computation, but does nothing to reduce memory growth with respect to the number of heads.

For modern LLMs with 32–128 heads and long context windows, KV cache memory and bandwidth quickly become the limiting factor in inference throughput.

This leads to a fundamental question:

Do attention heads really need independent keys and values?

Multi-Query Attention: Core Architectural Change

Multi-Query Attention (MQA) answers this by imposing a strong but deliberate constraint:

All attention heads have independent queries, but share a single set of keys and values.

Formally:

Q_i = X W_{Q_i}, \quad K = X W_K, \quad V = X W_V

Each head computes:

\text{Attention}_i = \text{softmax}\left( \frac{Q_i K^\top}{\sqrt{d_h}} \right) V

Important clarifications

Keys and values are shared across heads
Keys are not equal to values
$WK≠WVW_K \neq W_V$ — they remain distinct projections

This single design decision collapses the KV cache size by a factor of $H$ .

Weight Matrix Geometry

Let the model dimension be $d$ .

Multi-Head Attention (MHA)

$WQ∈Rd×(Hdh)W_Q \in \mathbb{R}^{d \times (H d_h)}$
$WK∈Rd×(Hdh)W_K \in \mathbb{R}^{d \times (H d_h)}$
$WV∈Rd×(Hdh)W_V \in \mathbb{R}^{d \times (H d_h)}$

Multi-Query Attention (MQA)

$WQ∈Rd×(Hdh)W_Q \in \mathbb{R}^{d \times (H d_h)}$
$WK∈Rd×dhW_K \in \mathbb{R}^{d \times d_h}$
$WV∈Rd×dhW_V \in \mathbb{R}^{d \times d_h}$

KV Cache Memory Comparison

Attention Type	KV Cache per Layer
Multi-Head Attention	$\times T \times d_h$
Multi-Query Attention	$\times T \times d_h$

For a 32-head model, MQA yields a 32× reduction in KV cache memory and memory bandwidth during decoding.

KV Cache Memory: MHA vs MQA (Illustrative Example)

Assume:

Layers (L): 80
Attention heads (H): 64
Head dimension (dₕ): 128
Context length (T): 2048
Precision: FP16 (2 bytes per element)

Attention Type	KV Cache Formula	KV Cache per Sequence
Multi-Head Attention (MHA)	`2 × L × H × T × dₕ × 2 bytes`	~1.2 GB
Multi-Query Attention (MQA)	`2 × L × 1 × T × dₕ × 2 bytes`	~19 MB
Reduction	—	~64× smaller

2 × accounts for storing both Keys and Values.

What MQA Actually Changes (Research View)

A common explanation claims:

“Most attention diversity comes from queries.”

This is incomplete and misleading.

The real story is about inductive bias and representational collapse.

Expressiveness in Multi-Head Attention

In MHA, each head has independent projections:

Q_i = X W_{Q_i}, \quad K_i = X W_{K_i}, \quad V_i = X W_{V_i}

This allows each head to learn a distinct attention subspace:

Different similarity metrics via $K_i$
Different retrieval semantics via $V_i$
Different alignment objectives via $Q_i$

From a geometric perspective, MHA spans multiple low-rank attention operators, enabling the model to represent competing relational views of the same sequence.

This is what enables heads to specialize in syntax, long-range dependency, positional bias, or coreference.

What MQA Removes

MQA enforces:

K_1 = \dots = K_H = K, \quad V_1 = \dots = V_H = V

As a result:

All heads score relevance in the same key space
All heads retrieve from the same value manifold
Head diversity exists only through queries

This collapses the attention operator from H independent subspaces into a single shared memory with multiple query routers.

The True Inductive Bias of MQA

MQA assumes:

A single shared representation of context is sufficient,
and attention diversity mainly arises from routing, not representation.

This is a non-trivial constraint on the hypothesis space.

It reduces the rank and diversity of attention mappings, limiting the model’s ability to represent multiple incompatible interpretations simultaneously.

Where Expressiveness Is Lost

Compared to MHA, MQA loses:

Per-head similarity metrics
Per-head semantic abstractions
Independent relational subspaces

This directly impacts the model’s ability to:

View the same token from different semantic angles
Encode orthogonal linguistic features in parallel
Maintain head-level specialization

In short:

MQA reduces the model’s “point-of-view capacity.”

This follows directly from the reduced rank and shared representation imposed by MQA.

Why MQA Still Works at Scale

Despite this loss, large models trained with MQA often show minimal degradation because:

Redundancy in MHA heads Many attention heads learn correlated or weakly distinct patterns.
Compensation by depth and width Feed-forward layers absorb representational burden.
Training adapts to the constraint Models trained from scratch with MQA learn robust shared KV spaces.
Inference dominates deployment cost Memory bandwidth, not expressiveness, becomes the bottleneck.

This explains why PaLM and inference-optimized LLMs adopt MQA successfully.

Autoregressive Inference Implications

During decoding:

Queries are recomputed per token
Keys and values are loaded from cache
Attention is computed

With MHA, step (2) loads $H$ KV tensors per layer.
With MQA, only one KV tensor is loaded.

This dramatically reduces:

Memory traffic
Cache pressure
Token latency

Summary: Compute vs Representation Trade-off

Aspect	MHA	MQA
Attention subspaces	Many	One
KV diversity	Per-head	Shared
Expressiveness	Higher	Lower
KV cache size	$O(HTdh)\mathcal{O}(H T d_h)$	$O(Tdh)\mathcal{O}(T d_h)$
Inference efficiency	Lower	Much higher

MQA is not a free optimization.
It is a deliberate architectural trade-off favoring inference scalability over maximal expressiveness.

Connect with me:

LinkedIn: (https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)

Caching Strategies for LLM Systems (Part 2): KV Cache and the Mathematics of Fast Transformer Inference

vaibhav ahluwalia — Mon, 19 Jan 2026 17:47:48 +0000

Diagram of self‑attention in transformers: inputs are transformed into Q (queries), K (keys), and V (values), and attention weights are computed using scaled dot products.

(Source: ResearchGate)

Autoregressive decoding in transformers is computationally expensive due to repeated self-attention over an ever-growing context. A naïve implementation recomputes attention over all previous tokens at every decoding step, leading to quadratic complexity per token. Transformer Key-Value (KV) caching eliminates this redundancy by storing attention keys and values from previous steps and reusing them during decoding.

If you missed it, in Part 1 of this series, we explored Exact-Match and Semantic Caching for LLMs, showing how we can reuse previous responses to reduce latency and token cost. In this post, we move to KV caching, a complementary technique that optimizes autoregressive decoding itself.

The Autoregressive Problem

Language models generate text one token at a time.

Mathematically, a model learns:

p(x₁, x₂, …, x_T) = ∏ₜ p(x_t | x₁, …, x_{t-1})

Which means:

Each new token depends on all previous tokens.

So when generating token x_t, the model must look at the entire prefix:

x₁, x₂, …, x_{t-1}

This is where attention comes in.

The Attention Equation (Core of Transformers)

Inside every transformer layer, attention is computed as:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

Where:

Q = Query vectors
K = Key vectors
V = Value vectors
d = dimension of each attention head

For a sequence of length t, we compute:

Q ∈ R^(t × d)
K ∈ R^(t × d)
V ∈ R^(t × d)

So attention builds a t × t similarity matrix.

The Hidden Inefficiency

During generation, tokens arrive one by one:

x₁ → x₂ → x₃ → ... → x_t

At step t, the model computes attention over:

x₁, x₂, ..., x_t

But notice something important:

The Key and Value vectors of previous tokens never change.

Once token x₁ is processed:

its Key = fixed
its Value = fixed

Yet naïve decoding recomputes them again and again.

So at step t, the model recomputes:

K₁, K₂, ..., K_{t-1}
V₁, V₂, ..., V_{t-1}

even though they were already computed at step t-1.

This is pure wasted compute.

Complexity Explosion

Let’s look at the cost.

At step t, attention costs:

O(t²)

So total decoding cost becomes:

Σₜ O(t²) = O(T³)

Which means:

Latency explodes with context length
Long prompts become impossible
Real-time chat breaks

This is why early transformers were painfully slow at inference.

The Key Insight: Keys and Values Are Invariant

Here is the beautiful observation:

Once a token is processed, its Key and Value never change.

So instead of recomputing them, we store them.

This is Key-Value caching.

KV Cache: The Mathematical Fix

At each step t, we do:

Compute Key and Value for the new token:

   k_t, v_t

Append them to cache:

   K₁:t = [K₁:t-1, k_t]
   V₁:t = [V₁:t-1, v_t]

Compute attention using cached keys and values:

   Attention(q_t, K₁:t, V₁:t)

Now we only compute attention between:

1 query vector
t cached key vectors

So attention becomes linear.

Complexity Reduction

Without KV Cache

Per token cost:

O(t²)

Total decoding:

O(T³)

With KV Cache

Per token cost:

O(t)

Total decoding:

O(T²)

This is a massive reduction.

This is why modern LLMs can handle:

Long conversations
Streaming responses
Real-time chat

Memory Tradeoff

KV cache trades compute for memory.

Each token stores:

its Key vector
its Value vector

So memory grows linearly with context length:

Memory ∝ T

This is why:

Long context models need large GPUs
Paged KV cache exists
KV quantization is popular

KV Cache Memory Analysis

KV caching trades extra memory for faster computation during autoregressive inference.
Each new token contributes one key and one value vector per attention head per layer.

Notation

Symbol	Meaning
`B`	Batch size (number of sequences processed together)
`N`	Sequence length (context + generated tokens)
`L`	Number of transformer layers
`H`	Number of attention heads per layer
`d_head`	Dimension of each attention head
`d_model`	Hidden size of the model, typically `d_model = H × d_head`
`b`	Bits per element (e.g., 16 for FP16, 4 for int4)

KV Cache Formula

For a single batch element, each token requires:

numbers_per_token = 2 × L × d_model

For a full sequence of length N, the total number of floats stored in KV cache is:

total_numbers = B × N × 2 × L × d_model

Converting to bytes, for b bits per number:

KV_cache_bytes = B × N × 2 × L × d_model × (b / 8)

Example: 7B Model (FP16)

Consider a 7B-parameter model with:

L = 32 layers
d_model = 4096 hidden size
Sequence length N = 4096
Batch size B = 1
FP16 → b = 16 bits (2 bytes per float)

KV_cache_bytes = 1 × 4096 × 2 × 32 × 4096 × 2
               = 2,147,483,648 bytes
               ≈ 2 GB

This shows how KV cache grows linearly with layers L, sequence length N, and batch B — which can become a memory bottleneck for large models.

Optimization: 4-bit KV Cache Quantization

If we store keys and values in 4-bit precision instead of FP16:

KV_cache_4bit = 2 GB × (4 / 16) = 0.5 GB

This dramatically reduces memory, enabling:

Longer context windows
Larger batch sizes
GPU-friendly inference

Quick Reference Table

Model	Seq Len	Layers	Hidden	Precision	KV Memory
7B	4096	32	4096	FP16	2 GB
7B	4096	32	4096	4-bit	0.5 GB

Sources: Hugging Face Transformers documentation & blog (on KV caching and cache quantization), and NVIDIA Developer blog on LLM inference memory formulas.

Final Intuition

KV cache turns transformers from slow batch models into fast streaming systems.

It is the difference between waiting minutes for a response and seeing tokens appear instantly.

Connect with me:

LinkedIn: (https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)

Caching Strategies for LLM Systems: Exact-Match & Semantic Caching

vaibhav ahluwalia — Sat, 17 Jan 2026 16:50:03 +0000

LLM calls are expensive in latency, tokens, and compute. Caching is one of the most effective levers to reduce cost and speed up responses. This post explains two foundational caching techniques you can implement today: Exact-match (key-value) caching and Semantic (embedding) caching. We cover how each works, typical implementations, pros/cons, and common pitfalls.

Why caching matters for LLM systems

Every LLM call carries three primary costs:

Network latency — round-trip time to the API or inference cluster.
Token cost — many APIs charge per input + output tokens.
Compute overhead — CPU/GPU time spent running the model.

In production applications many queries repeat (exactly or semantically). A cache allows the system to return prior results without re-running the model, producing immediate wins in latency, throughput, and cost.

Key benefits:

Lower response time for end users.
Reduced API bills and compute consumption.
Higher throughput and better user experience at scale.

A thoughtful caching layer is often one of the highest-ROI engineering efforts for LLM products.

Exact-match (Key-Value) caching

How it works

Exact-match caching stores an LLM response under a deterministic key derived from the prompt (and any contextual state). When the same key is seen again, the cache returns the stored response.

Input prompt → Normalization → Hash/key → Lookup in KV store → Return stored response

Implementation notes

Normalization (optional but recommended): trim whitespace, canonicalize newlines, remove ephemeral metadata, and ensure consistent parameter ordering.
Key generation: use a stable hashing function (SHA-256) over the normalized prompt plus any relevant metadata (system prompt, temperature, model name, conversation id, schema version).
Storage: simple in-memory dict for prototypes; Redis/KeyDB for production; or a persistent object store for large responses.
Validation: store metadata with the response — model version, temperature, timestamp, source prompt — so you can safely decide whether a cached result is still valid or should be invalidated.

Simple Python example (conceptual)

import hashlib
import json

def make_key(prompt: str, system_prompt: str = "", model: str = "gpt-x", schema_version: str = "v1") -> str:
    normalized = "\n".join(line.strip() for line in prompt.strip().splitlines())
    payload = json.dumps({
        "system": system_prompt,
        "prompt": normalized,
        "model": model,
        "schema": schema_version,
    }, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

# Example usage:
# key = make_key(user_prompt, system_prompt, model_name)
# if key in kv_store: return kv_store[key]

When to use exact caching

Deterministic workflows (e.g., agent step outputs).
Repeated system prompts and templates.
Where correctness requires exact reuse (no hallucination risk from mismatched context).

Advantages: simple, deterministic, zero false-positive risk.

Limitations: low hit rate for free-form natural language; brittle to minor prompt changes.

Semantic caching

How it works

Semantic caching stores an embedding for each prompt together with the response. For a new prompt, compute its embedding, perform a nearest-neighbor search among cached vectors, and reuse the cached response if similarity exceeds a threshold.

Prompt → Embedding → Similarity search in vector store → If max_sim ≥ threshold → reuse response

Implementation notes

Embeddings: choose a consistent embedding model. Store normalized prompt text, the embedding vector, response, and metadata (model, generation parameters, timestamp, schema version).
Vector store: FAISS, Milvus, Pinecone, Weaviate, or Redis Vector are common options depending on scale and latency needs.
Similarity metric: cosine similarity is standard for text embeddings. Use the same metric in indexing and querying.
Thresholding: set a threshold that balances reuse vs. safety. Typical cosine thresholds vary by embedding model — tune on your dataset (often starting conservatively around 0.85–0.9).

Conceptual example (pseudo-Python)

# compute embedding for new prompt
q_vec = embed(prompt)

# nearest neighbor search -> returns (id, sim_score)
nearest_id, sim = vector_store.search(q_vec, k=1)

if sim >= SIM_THRESHOLD:
    response = cache_lookup(nearest_id)
else:
    response = call_llm(prompt)
    store_embedding_and_response(q_vec, prompt, response)

Tuning similarity and safety

Calibration: evaluate the similarity threshold on a held-out set of paraphrases and unrelated prompts to estimate false-positive reuse.
Hybrid checks: for high-risk outputs, combine semantic match with lightweight heuristics (e.g., entity overlap, output-shape checks) or a fast reranker before returning cached content.
Metadata gating: ensure model version, schema version, or prompt-template changes invalidate or block reuse.

Advantages: handles paraphrases; higher effective cache hit rate for conversational queries.

Limitations: requires embeddings, vector storage, and careful tuning to avoid incorrect reuse.

Choosing between exact and semantic caching

Use exact-match caching when correctness and determinism matter and prompts are highly templated.
Use semantic caching when queries are natural language, paraphrases are common, and some approximation is acceptable in exchange for higher hit rates.
Hybrid approach: an effective production design usually combines both. Try exact-match first; if it misses, fall back to semantic search. Store both kinds of keys and de-duplicate on insertion.

Metrics, monitoring, and operational concerns

Track these key metrics:

Cache hit rate (exact / semantic)
End-to-end latency for cache hits vs misses
Cost saved (tokens/compute avoided)
False reuse incidents (semantic false positives) and user impact

Operational concerns:

Eviction policy & TTL — balance storage costs and freshness.
Model upgrades — invalidate or tag cache entries produced by older model versions (or bump schema version).
Privacy & sensitivity — avoid caching PII or sensitive outputs unless encrypted and access-controlled.
Auditability — log when responses were served from cache and the matched key/score.

Implementation & Code

Want to see working examples? Check out the implementation with code:

VaibhavAhluwalia / llm-caching-systems

Practical implementations and experiments for building fast, scalable, and cost-efficient Large Language Model (LLM) applications using caching techniques.

The repository includes:

Interactive notebooks demonstrating both caching strategies
Requirements file for easy setup

Conclusion and what's next (Part 2)

Exact-match and semantic caching are foundational. Together they allow LLM systems to be faster and cheaper while retaining the benefits of large models.

In Part 2 of this series we'll cover other techniques

What caching strategies have worked best in your LLM projects? Share your experiences in the comments below!

Connect with me:

GitHub: @VaibhavAhluwalia
LinkedIn: [https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)

DEV Community

A space to discuss and keep up software development and manage your software career

dev.to