DEV Community: Kotcherla Murali Krishna

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

Kotcherla Murali Krishna — Tue, 09 Jun 2026 02:26:16 +0000

A deep dive into memory fragmentation, paged memory management, and why PagedAttention can deliver up to 24× higher throughput than conventional KV cache implementations.

Every token you generate during LLM inference silently eats GPU memory. With traditional KV caching, a significant portion of that memory is wasted — never used, never reclaimed. vLLM’s PagedAttention changed that by borrowing a decades-old idea from operating systems. Here’s exactly how it works and why it matters.

What Is a KV Cache and Why Does It Exist?
The Problem: Traditional KV Cache and Memory Fragmentation
Inspiration from OS Virtual Memory — The vLLM Insight
PagedAttention: How It Works
Memory Fragmentation: Before vs After
Throughput Gains: Numbers and Benchmarks
Trade-offs and Limitations
Who Should Care About This?
Key Takeaways

Section 1 — What Is a KV Cache and Why Does It Exist?

What to cover:

During autoregressive decoding, each new token attends to all previous tokens
Without caching: recompute keys/values for all past tokens at every step → O(n²) compute cost
With KV cache: store key/value tensors per layer per token in GPU memory → amortize the attention cost
KV cache size formula: 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes

Code snippet to include:

# Rough KV cache size estimation
num_layers = 32 # LLaMA-2 7B
num_heads = 32
head_dim = 128 # typically d_model / num_heads
seq_len = 2048
batch_size = 8
dtype_bytes = 2 # float16

kv_cache_bytes = (
    2 * num_layers * num_heads * head_dim * seq_len * batch_size * dtype_bytes
)
print(f"KV cache: {kv_cache_bytes / 1e9:.2f} GB")
# Output: KV cache: 8.59 GB ← just for 8 sequences at 2K context

Key insight to land:

The KV cache is not a nice-to-have — it’s essential for serving LLMs at any reasonable speed. But the way it was traditionally allocated is deeply wasteful.

Section 2 — The Problem: Traditional KV Cache and Memory Fragmentation

What to cover:

Pre-allocation problem:

Traditional frameworks pre-allocate a contiguous block of GPU memory per sequence based on the maximum possible sequence length
A request with max_seq_len=2048 gets 2048 slots — even if it only generates 50 tokens
Wasted memory = (max_seq_len - actual_len) × per_token_kv_size

Three types of fragmentation:

Three types of fragmentation

Real-world consequence:

GPU shows 80% memory used → only 30–40% is actually holding useful KV data
Batching becomes impossible because you can’t fit more sequences even though memory “exists”
High latency under load → queuing, not computation, becomes your bottleneck

Diagram suggestion:

Traditional KV Cache

Section 3 — Inspiration from OS Virtual Memory: The vLLM Insight

What to cover:

The vLLM team (Kwon et al., 2023) drew a direct parallel to how OS manages RAM via paging
In OS paging: physical RAM is divided into fixed-size “frames”; processes get “pages” mapped to frames via a page table
Physical memory can be non-contiguous; the virtual address space appears contiguous to the process
Key OS insight: don’t allocate memory until you actually need it (demand paging)

The analogy table:

The analogy table

Quote worth referencing:

The vLLM paper describes PagedAttention as managing the KV cache “like virtual memory in an OS, with paging.” — Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention_, 2023._

Section 4 — PagedAttention: How It Works

What to cover:

Block-based KV cache:

GPU memory is divided into fixed-size KV blocks (e.g., 16 tokens per block)
Each block holds key/value tensors for exactly block_size tokens across all layers
A block table maps each sequence to a list of physical block indices — like a page table

Allocation lifecycle:

Request arrives → zero blocks allocated
First block_size tokens generated → one block allocated
Next block_size tokens → second block allocated (can be physically non-contiguous)
Request completes → blocks freed immediately, returned to pool

Prompt sharing (copy-on-write):

Multiple requests with identical prefixes (e.g., a system prompt) can share the same physical KV blocks
On divergence (beam search branch, parallel sampling), blocks are copied — copy-on-write semantics
Massive memory savings in high-traffic deployments with shared system prompts

Code snippet (simplified block table concept):

from dataclasses import dataclass, field
from typing import List, Optional

BLOCK_SIZE = 16 # tokens per block

@dataclass
class KVBlock:
    block_id: int
    token_count: int = 0
    is_shared: bool = False

@dataclass
class SequenceState:
    seq_id: int
    block_table: List[int] = field(default_factory=list) # physical block IDs
    total_tokens: int = 0

class PagedKVManager:
    def __init__ (self, total_blocks: int):
        self.free_blocks = list(range(total_blocks))
        self.blocks = {i: KVBlock(block_id=i) for i in range(total_blocks)}

    def allocate_block(self) -> Optional[int]:
        if not self.free_blocks:
            return None # OOM — trigger preemption
        return self.free_blocks.pop()

    def append_token(self, seq: SequenceState) -> bool:
        """Allocate a new block if the current one is full."""
        needs_new_block = (
            not seq.block_table or
            self.blocks[seq.block_table[-1]].token_count == BLOCK_SIZE
        )
        if needs_new_block:
            block_id = self.allocate_block()
            if block_id is None:
                return False # signal OOM
            seq.block_table.append(block_id)

        current_block = self.blocks[seq.block_table[-1]]
        current_block.token_count += 1
        seq.total_tokens += 1
        return True

    def free_sequence(self, seq: SequenceState):
        for block_id in seq.block_table:
            self.blocks[block_id].token_count = 0
            self.free_blocks.append(block_id)
        seq.block_table.clear()

Section 5 — Memory Fragmentation: Before vs After

What to cover:

Traditional system waste breakdown (typical):

Internal fragmentation: ~20–30% (unused reserved slots within sequences)
External fragmentation: ~10–15% (gaps between sequence allocations)
Effective utilization: often 60–70% at best

PagedAttention waste profile:

Internal fragmentation: only within the last block of each sequence → max block_size - 1 tokens wasted per sequence
External fragmentation: near zero — all blocks are equal-sized, pool is homogeneous
Effective utilization: typically 90–96%

Fragmentation comparison diagram:

Fragmentation comparision diagram

Section 6 — Throughput Gains: Numbers and Benchmarks

What to cover:

From the vLLM paper (Kwon et al., 2023):

vLLM achieves 2–4× higher throughput vs HuggingFace Transformers on the same hardware
vs Orca (a continuous batching baseline without PagedAttention): up to 1.7× higher throughput
At high request rates with long sequences, gains can reach up to 24× over naive implementations

Why throughput improves:

Higher GPU memory utilization → more sequences in a batch simultaneously
Less queuing → GPU stays busy rather than stalling waiting for memory
Prompt sharing → system-prompt KV computed once, shared across N requests
Faster preemption → on OOM, swap individual blocks not entire sequences

Throughput table (representative, based on vLLM paper figures):

Throughput table

Note: Actual numbers vary by model size, GPU type, sequence length distribution, and request arrival pattern. Always benchmark on your workload.

Code snippet — measuring throughput with vLLM:

from vllm import LLM, SamplingParams
import time

llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    gpu_memory_utilization=0.90, # PagedAttention uses 90% of GPU VRAM
    max_num_batched_tokens=8192,
)

prompts = ["Explain quantum entanglement in simple terms."] * 100
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start

total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Throughput: {len(prompts)/elapsed:.2f} req/s")
print(f"Token throughput: {total_tokens/elapsed:.0f} tok/s")

Section 7 — Trade-offs and Limitations

What to cover (balanced, Medium readers appreciate honesty):

What you gain:

Dramatically higher memory utilization
Much better batch concurrency → lower latency at scale
Prefix caching out of the box in modern vLLM versions

What you trade or accept:

Block table overhead : small CPU-side memory + lookup cost per attention step
Block size tuning : too small → excessive block table management; too large → internal fragmentation creeps back
Attention kernel complexity : standard Flash Attention doesn’t natively support paged KV; vLLM ships custom CUDA kernels
Not a magic fix for compute : PagedAttention solves the memory bottleneck — if you’re compute-bound (long context, large model), gains are smaller
Prefill is unaffected : PagedAttention mainly helps the decode phase; prefill still processes the full prompt sequentially

Block size sensitivity:

# Approximate internal fragmentation vs block size
def max_waste_fraction(block_size, avg_seq_len):
    # Worst case: last block is almost empty
    max_wasted_per_seq = block_size - 1
    return max_wasted_per_seq / avg_seq_len

for bs in [8, 16, 32, 64]:
    waste = max_waste_fraction(bs, avg_seq_len=256)
    print(f"block_size={bs:3d}: max waste = {waste*100:.1f}%")

# block_size= 8: max waste = 2.7%
# block_size= 16: max waste = 5.9%
# block_size= 32: max waste = 12.1%
# block_size= 64: max waste = 24.6%

vLLM defaults to block_size=16 as a practical sweet spot.

Section 8 — Who Should Care About This?

Target reader callouts:

If you’re serving LLMs in production:

This is the single most impactful infrastructure change for throughput under concurrent load
Switch to vLLM or an equivalent (SGLang, TensorRT-LLM with paged cache support)

If you’re building on top of HuggingFace Transformers:

Great for research/single-user; not designed for high-concurrency serving
The KV cache in model.generate() uses contiguous pre-allocation — fine for one request, fragmentation-heavy at scale

If you’re doing fine-tuning or evaluation:

PagedAttention is a serving optimization; during training, you typically don’t cache KV across steps the same way
Still useful to understand for when you deploy your fine-tuned model

If you’re reading vLLM/SGLang source code:

BlockSpaceManager, BlockAllocator, and Scheduler are the core classes to study
The block table is maintained per-sequence in the scheduler, not inside the model forward pass

Section 9 — Key Takeaways

Traditional KV cache pre-allocates contiguous memory per sequence based on max length → 30–40% typical waste
Memory fragmentation — internal, external, and reservation — is the root cause of low GPU utilization and poor batching in naive LLM serving
PagedAttention borrows OS virtual memory paging : fixed-size blocks, on-demand allocation, block tables instead of contiguous buffers
Result: 90–96% memory utilization , enabling more sequences per batch and dramatically higher throughput
vLLM demonstrated 2–24× throughput gains depending on workload and baseline compared
Key trade-offs : custom CUDA kernels required, block size tuning matters, compute-bound workloads see smaller benefits
Prompt sharing (copy-on-write) is an underrated bonus: shared prefixes are computed once, referenced many times

Memory Systems for AI Agents: The Complete Developer Guide

Kotcherla Murali Krishna — Sun, 24 May 2026 16:41:36 +0000

How modern AI agents remember, reason, and learn — and how to build them right

Introduction

Every time you close a chat window, the AI forgets you exist.

This is not a bug — it is architecture. Large language models are, by default, stateless. They receive a prompt, generate a response, and discard everything. But the next generation of AI agents needs to do far more: track long-running tasks, learn from past interactions, coordinate across sessions, and reason over accumulated knowledge.

Memory is the missing layer that transforms a chatbot into an agent.

This guide breaks down every major memory system used in AI agents today — what they are, how they work, when to use each, and how to combine them into production-ready architectures.

Why Memory Matters

Consider the difference between these two interactions:

Without memory:

User: “What did we decide about the database schema last Tuesday?” Agent: “I don’t have access to previous conversations.”

With memory:

User: “What did we decide about the database schema last Tuesday?” Agent: “You decided to normalize the user table into three relations and defer the indexing strategy to after the first load test.”

The gap is not intelligence — it is persistence. Memory gives agents:

Continuity across sessions and workflows
Personalization based on accumulated user context
Efficiency by avoiding repeated reasoning over the same facts
Autonomy to pursue multi-step goals without human re-prompting at every step

The Four Types of Agent Memory

AI agent memory maps loosely onto cognitive science. Researchers typically distinguish four types, each serving a different purpose.

1. Sensory / Working Memory (In-Context)

What it is: Everything currently inside the model’s context window — the “working desk” of the agent.

How it works: The transformer attention mechanism operates over all tokens in the context window simultaneously. This is the only memory that directly influences model outputs.

Characteristics:

Fast — zero retrieval latency
Limited — bounded by context window size (4K to 2M tokens depending on model)
Volatile — completely lost when the session ends
Ordered — the model can reason over temporal sequences within context

When to use it:

Current task state, tool outputs, user messages in the active session
Recently retrieved facts that need active reasoning
Intermediate reasoning steps (chain-of-thought scratchpads)

Implementation:

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Step 1 result: ..."},
    {"role": "assistant", "content": "Proceeding to step 2..."},
    {"role": "user", "content": "Step 2 result: ..."},
]
response = client.messages.create(model="claude-sonnet-4-20250514", messages=messages)

Key insight: Context window management is itself a memory problem. When context fills up, agents must decide what to summarize, compress, or evict — turning working memory into a policy decision.

2. Episodic Memory (Conversation & Event History)

What it is: Records of specific past events — conversations, actions taken, outcomes observed.

How it works: Episodes are stored externally (database, vector store, log files) and retrieved selectively into context when relevant.

Characteristics:

Persistent across sessions
Structured around time and causality (“what happened, when, and what followed”)
Indexed for retrieval by recency, relevance, or both
Can store raw transcripts or summarized episode representations

When to use it:

User conversation history (“You mentioned last week that…”)
Agent action logs for debugging and auditing
Workflow checkpointing for long-running tasks
Learning from past successes and failures

Implementation pattern:

# Store episode
episode = {
    "session_id": "abc123",
    "timestamp": "2025-05-22T14:30:00Z",
    "user_input": "Analyze the Q2 sales data",
    "agent_actions": ["read_csv", "compute_summary", "generate_chart"],
    "outcome": "success",
    "summary": "User requested Q2 analysis. Identified 18% YoY growth in APAC region."
}
db.episodes.insert(episode)

# Retrieve relevant episodes
past_episodes = db.episodes.find({
    "user_id": current_user,
    "timestamp": {"$gte": thirty_days_ago}
}).sort("timestamp", -1).limit(5)

Design consideration: Store both raw episodes and distilled summaries. Raw episodes support audit trails and replay; summaries support fast context injection.

3. Semantic Memory (Knowledge & Facts)

What it is: General knowledge, facts, concepts, and domain expertise — decoupled from specific events.

How it works: Information is embedded into a vector space and stored in a vector database. At query time, semantically similar content is retrieved and injected into context (Retrieval-Augmented Generation, or RAG).

Characteristics:

Persistent and shareable across users and sessions
Retrieved by semantic similarity, not exact match
Scales to millions of documents
Requires embedding model + vector store infrastructure

When to use it:

Company knowledge bases, documentation, FAQs
Domain-specific corpora (legal, medical, financial)
Product catalogs, policy documents
Any knowledge too large to fit in context

Implementation with semantic search:

from sentence_transformers import SentenceTransformer
import chromadb

# Index documents
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("knowledge_base")

documents = load_documents("./docs/")
embeddings = model.encode([doc.text for doc in documents])
collection.add(
    embeddings=embeddings.tolist(),
    documents=[doc.text for doc in documents],
    ids=[doc.id for doc in documents]
)

# Retrieve at query time
query_embedding = model.encode([user_query])
results = collection.query(query_embeddings=query_embedding, n_results=5)

# Inject into context
context = "\n\n".join(results["documents"][0])
augmented_prompt = f"Use the following context:\n{context}\n\nUser query: {user_query}"

Advanced pattern — Hybrid Retrieval:

Pure vector search misses exact keyword matches. Combine BM25 (keyword) with dense retrieval (semantic) for higher recall:

from rank_bm25 import BM25Okapi
from sklearn.preprocessing import normalize
import numpy as np

def hybrid_retrieve(query, documents, embeddings, alpha=0.5, top_k=5):
    # BM25 scores
    tokenized = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.split())

    # Dense scores
    query_emb = model.encode([query])
    dense_scores = np.dot(normalize(embeddings), normalize(query_emb).T).flatten()

    # Combine
    combined = alpha * normalize([bm25_scores])[0] + (1 - alpha) * dense_scores
    top_indices = np.argsort(combined)[::-1][:top_k]
    return [documents[i] for i in top_indices]

4. Procedural Memory (Skills & Workflows)

What it is: Encoded knowledge of how to do things — reusable procedures, tool-use patterns, and learned behavioral strategies.

How it works: Can be represented as prompts (few-shot examples), code (tool implementations), structured workflows (graphs, state machines), or fine-tuned model weights.

Characteristics:

Represents capability, not facts
Typically stable — updated less frequently than episodic or semantic memory
Can be invoked on demand (“use the data_analysis skill”)
Encodes both successful and corrected failure patterns

When to use it:

Standardized multi-step workflows (data pipelines, report generation)
Tool usage patterns and API call sequences
Domain-specific reasoning strategies
Safety and compliance guardrails as behavioral constraints

Implementation — Prompt-based procedural memory:

PROCEDURES = {
    "data_analysis": """
        When analyzing data:
        1. First, describe the dataset schema and shape.
        2. Check for missing values and outliers.
        3. Compute descriptive statistics.
        4. Identify trends and correlations.
        5. Summarize key findings in plain language.
        Always cite the row count and column names in your response.
    """,
    "code_review": """
        When reviewing code:
        1. Check for correctness first (does it do what it claims?).
        2. Identify security vulnerabilities (injection, auth, secrets).
        3. Assess performance implications.
        4. Comment on readability and maintainability.
        5. Suggest concrete improvements with code examples.
    """
}

def inject_procedure(task_type: str, base_prompt: str) -> str:
    procedure = PROCEDURES.get(task_type, "")
    return f"{procedure}\n\n{base_prompt}" if procedure else base_prompt

Memory Storage Backends

Choosing the right storage layer is as important as choosing the right memory type.

Practical recommendation: Start with PostgreSQL + pgvector. It handles relational structure (episodes, user data) and vector search in one system, avoiding the operational overhead of a separate vector database until scale demands it.

Memory Architecture Patterns

Pattern 1: The Memory Stack

The simplest production pattern layers all four memory types, with each feeding into context at retrieval time.

Memory write-back is critical and often overlooked. After each interaction, the agent should update:

Episodic store with a summary of what happened
Semantic store if new facts were established
Procedural store if a new workflow pattern emerged

Pattern 2: Memory-Augmented ReAct

ReAct (Reason + Act) agents interleave reasoning steps with tool calls. Memory becomes a first-class tool:

Thought: I need to check if we’ve handled this type of request before.

Action: memory_search(query=”database migration rollback”, type=”episodic”)

Observation: Found 3 similar episodes. In 2/3 cases, the solution was…

Thought: Based on past experience, I should first verify the backup exists.

Action: check_backup(database=”prod_db”)

…

This makes memory transparent, auditable, and controllable.

Pattern 3: Hierarchical Summarization

For very long-running agents (days to weeks), raw episode storage becomes unmanageable. Use hierarchical summarization:

Raw episodes (last 24h) → Daily summary

Daily summaries (last 7d) → Weekly summary

Weekly summaries → Persistent user profile

This mirrors how humans consolidate memories during sleep — detail fades, patterns persist.

async def consolidate_memory(user_id: str):
    # Fetch yesterday's raw episodes
    yesterday = datetime.now() - timedelta(days=1)
    episodes = await db.get_episodes(user_id, since=yesterday)

    # Summarize via LLM
    raw_text = "\n".join([e["summary"] for e in episodes])
    daily_summary = await llm.summarize(
        f"Summarize these agent interactions into key facts and outcomes:\n{raw_text}"
    )

    # Store consolidated summary
    await db.store_daily_summary(user_id, daily_summary, date=yesterday)

    # Optionally archive raw episodes to cold storage
    await db.archive_episodes(user_id, before=yesterday)

Memory Retrieval Strategies

Recency-Weighted Retrieval

Recent information is usually more relevant. Apply time decay to retrieval scores:

import math
from datetime import datetime

def time_decay_score(base_score: float, created_at: datetime, half_life_days: float = 7) -> float:
    days_elapsed = (datetime.now() - created_at).days
    decay = math.exp(-0.693 * days_elapsed / half_life_days)
    return base_score * decay

Importance-Based Retention

Not all memories are equal. Score importance at write time:

IMPORTANCE_SIGNALS = {
    "user_correction": 1.0, # "No, that's wrong — it should be..."
    "explicit_preference": 0.9, # "I always prefer..."
    "task_success": 0.7, # Completed goals
    "factual_statement": 0.5, # Stated facts
    "casual_mention": 0.2, # Passing references
}

def score_importance(episode: dict) -> float:
    signal = episode.get("signal_type", "casual_mention")
    return IMPORTANCE_SIGNALS.get(signal, 0.3)

Contextual Compression

Before injecting retrieved memories into context, compress them to fit within token budgets:

async def compress_memories(memories: list[str], max_tokens: int = 800) -> str:
    combined = "\n".join(memories)
    if count_tokens(combined) <= max_tokens:
        return combined

    compressed = await llm.complete(
        f"Compress the following memory entries to under {max_tokens} tokens, "
        f"preserving only the most relevant facts:\n\n{combined}"
    )
    return compressed

Common Pitfalls and How to Avoid Them

Pitfall 1: Memory Hallucination Propagation

If the agent writes a hallucinated fact to memory, it reinforces itself across future sessions. The agent becomes increasingly confident in a falsehood.

Fix: Apply confidence thresholds at write time. Only persist memories that pass a factual grounding check, or flag them with uncertainty metadata.

async def write_memory_safe(fact: str, source: str) -> bool:
    grounding_check = await llm.complete(
        f"Is the following statement grounded in verifiable evidence? "
        f"Reply only YES or NO.\n\nStatement: {fact}\nSource: {source}"
    )
    if "YES" in grounding_check.upper():
        await memory_store.write(fact, source=source, confidence="high")
        return True
    else:
        await memory_store.write(fact, source=source, confidence="uncertain")
        return False

Pitfall 2: Context Flooding

Retrieving too many memories degrades performance. Studies show LLM accuracy drops when context is cluttered with marginally relevant content (“lost in the middle” problem).

Fix: Enforce hard limits on retrieved memory tokens (e.g., max 20% of context window), and rank retrieval results before injection.

Pitfall 3: No Memory Eviction Policy

Without eviction, stores grow unbounded. Old, irrelevant memories add noise and increase costs.

Fix: Implement TTL (time-to-live) for episodic memories and importance-based pruning for semantic stores.

# PostgreSQL: auto-expire old low-importance episodes
CREATE INDEX idx_expires_at ON episodes(expires_at);

-- Set TTL on insert
INSERT INTO episodes (content, importance, expires_at)
VALUES ($1, $2, NOW() + INTERVAL '30 days' * $2); -- importance scales retention

Pitfall 4: Memory Without Privacy Controls

In multi-user systems, memory isolation failures can leak one user’s data to another.

Fix: Namespace all memory keys by user ID. Apply row-level security in the database. Audit access patterns.

# Always scope queries to the requesting user
async def retrieve_memory(query: str, user_id: str) -> list:
    results = await vector_store.query(
        embedding=embed(query),
        filter={"user_id": user_id}, # Hard filter, not just ranking
        top_k=5
    )
    return results

Putting It All Together: A Reference Architecture

class AgentMemorySystem:
    def __init__ (self, user_id: str):
        self.user_id = user_id
        self.episodic_db = PostgreSQLStore()
        self.semantic_db = VectorStore(collection=f"user_{user_id}")
        self.procedures = ProcedureRegistry()
        self.working_memory = [] # In-context state

    async def retrieve_context(self, query: str, max_tokens: int = 2000) -> str:
        """Retrieve and assemble relevant memory for the current query."""

        # 1. Fetch recent episodes (recency bias)
        episodes = await self.episodic_db.recent(self.user_id, n=5)

        # 2. Semantic search for relevant knowledge
        knowledge = await self.semantic_db.search(query, top_k=5)

        # 3. Select relevant procedure
        procedure = self.procedures.select(query)

        # 4. Assemble and compress to fit token budget
        context_parts = [
            f"Relevant past interactions:\n{self._format(episodes)}",
            f"Relevant knowledge:\n{self._format(knowledge)}",
            f"Behavioral guidelines:\n{procedure}",
        ]

        return await compress_to_budget("\n\n".join(context_parts), max_tokens)

    async def write_back(self, interaction: dict):
        """Persist memory after each interaction."""
        summary = await self._summarize(interaction)
        importance = score_importance(interaction)

        await self.episodic_db.insert({
            "user_id": self.user_id,
            "summary": summary,
            "importance": importance,
            "expires_at": compute_expiry(importance)
        })

        # Extract any new standalone facts for semantic store
        facts = await self._extract_facts(interaction)
        for fact in facts:
            await self.semantic_db.upsert(fact, metadata={"user_id": self.user_id})

Key Takeaways

Memory is not one thing. Working, episodic, semantic, and procedural memory serve different roles and require different implementations.
Write-back is as important as retrieval. Memory systems that only read from the past but never learn from the present are incomplete.
Start simple, scale deliberately. In-context state → external episodic store → vector RAG → consolidated profiles. Add layers as complexity demands.
Design for eviction from day one. Unlimited memory growth is a cost and accuracy problem, not just a storage problem.
Privacy is an architecture decision. Namespace, isolate, and audit memory access from the beginning — retrofitting privacy controls is expensive.
Transparency beats opacity. Make memory retrieval visible to users. The ability to inspect, correct, and delete agent memories builds trust and improves system accuracy.

If this guide helped you build better agents, consider following for more deep dives on AI systems engineering, LLM inference optimization, and production agent architecture.

Tags: Artificial Intelligence · Machine Learning · LLM · AI Agents · Software Engineering

Building Micro Agents as Production-Grade Microservices

Kotcherla Murali Krishna — Sun, 24 May 2026 04:24:16 +0000

Build production-grade AI agent systems using microservices. Covers FastAPI, gRPC, Kafka, Kubernetes, OpenTelemetry, and fault-tolerant orchestration patterns in Python.

Introduction & Motivation
Core Architecture Principles
Agent Service Design
The AgentRunner Loop
Inter-Agent Communication
Tool Registry Service
Memory Architecture
Context Window Management
Orchestrator & Supervisor Pattern
Security & Authorization
Observability: Traces, Logs, Metrics
Deployment on Kubernetes
Scaling Strategies
Fault Tolerance & Retry Strategies
Testing Agent Microservices
CI/CD Pipeline for Agent Services
Cost Management & Token Budgeting
Production Readiness Checklist
Reference Architecture Diagram

Introduction & Motivation

Why monolithic agent systems fail in production

A single-process agent that handles reasoning, tool calls, memory retrieval, and output generation works well in prototypes. In production it breaks in predictable ways:

Latency coupling — one slow tool call blocks the entire inference loop
Unscalable compute — you cannot scale the summarization workload independently from the search workload
Blast radius — a single LLM API timeout or memory corruption takes the whole system down
Zero deployment granularity — updating one tool integration requires redeploying everything
No isolation for billing — impossible to attribute compute cost to individual agent functions

The microservice solution

Each autonomous capability becomes an independently deployable, independently scalable service with:

Its own API surface (HTTP/gRPC)
Its own health checks and readiness probes
Its own memory scope (no shared in-process state)
Its own tool bindings (resolved at runtime from a Tool Registry)
Its own observability (distributed traces, metrics, structured logs)

What is a Micro Agent?

A micro agent is a bounded autonomous service that:

Accepts a task (prompt + context + session ID) via an API call
Runs a plan → act → observe loop using an LLM backend
Invokes tools via a centralized Tool Registry
Stores and retrieves conversation state from an external memory store
Returns a typed result or emits an event to downstream consumers

Key insight: A micro agent is not a “smart function” — it is a service with its own API contract, memory scope, failure modes, and SLA. Design it accordingly.

Core Architecture Principles

Single Responsibility

Each agent owns exactly one reasoning domain. Examples:

Stateless Reasoning, Stateful Memory

The LLM inference step must be stateless. Memory lives in external stores:

No conversation history should ever live in in-process RAM between requests.

Schema-First Tool Contracts

Every tool must have a JSON Schema definition published to a shared Tool Registry before any agent can invoke it. No ad-hoc function signatures. This enables:

Runtime input validation before LLM output reaches backend services
Auto-generated documentation
Tool versioning with backwards compatibility checks

Idempotent Actions

Any tool call that modifies external state (send email, write to DB, trigger webhook) must be idempotent. Strategies:

Use idempotency keys at the HTTP layer (pass Idempotency-Key header)
Use message deduplication at the queue level (Kafka exactly-once semantics)
Design tool handlers to be safe to retry: check-then-act patterns

Async by Default

Long-running agent tasks (multi-step research, code generation + execution) must use async task queues — not synchronous HTTP with long timeouts.

Client ──► POST /tasks ──► Kafka/BullMQ ──► AgentWorker

Client ──► GET /tasks/{id} ──► Redis (status polling)

◄── WebSocket/SSE push (optional)

Explicit Context Boundaries

Each agent invocation carries a bounded context packet — never grow unbounded message histories. A ContextManager service compresses/summarizes history before injection.

Agent Service Design

Project Layout

Each agent is a containerized FastAPI or gRPC service with this canonical structure:

agent-search/

├── agent/

│ ├── core.py # AgentRunner: plan → act → observe loop

│ ├── prompts.py # System prompt + few-shot templates

│ ├── memory.py # ContextManager: load/compress/save

│ ├── tools.py # Tool bindings (calls Tool Registry)

│ └── schemas.py # Pydantic models for all I/O

├── api/

│ ├── routes.py # POST /run, GET /status/{task_id}

│ ├── middleware.py # Auth, rate limiting, request tracing

│ └── deps.py # Dependency injection: DB, Redis, LLM client

├── tests/

│ ├── unit/

│ ├── integration/

│ └── fixtures/

├── Dockerfile

├── pyproject.toml

└── k8s/

├── deployment.yaml

├── service.yaml

├── hpa.yaml

└── configmap.yaml

API Contract

Every agent exposes these HTTP endpoints at minimum:

POST /run Submit a task (sync, short tasks only)

POST /tasks Submit a task (async, returns task_id)

GET /tasks/{task_id} Poll task status and result

GET /health Liveness probe

GET /ready Readiness probe (checks LLM + memory store)

GET /metrics Prometheus metrics endpoint

# agent/schemas.py
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any
from enum import Enum

class TaskStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELLED = "cancelled"

class AgentTask(BaseModel):
    id: str
    session_id: str
    prompt: str
    metadata: Dict[str, Any] = Field(default_factory=dict)
    max_steps: int = Field(default=10, ge=1, le=25)
    token_budget: int = Field(default=8192, ge=512, le=32768)

class AgentResult(BaseModel):
    task_id: str
    status: TaskStatus
    output: Optional[str] = None
    steps_used: int = 0
    tokens_used: int = 0
    tool_calls: int = 0
    error: Optional[str] = None
    duration_ms: int = 0

The AgentRunner Loop

Full Implementation

# agent/core.py
import asyncio
import time
from opentelemetry import trace
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

tracer = trace.get_tracer( __name__ )
MAX_STEPS = 15

class AgentRunner:
    def __init__ (self, agent_id: str, config: AgentConfig):
        self.agent_id = agent_id
        self.llm = LLMClient(model=config.model, timeout=30)
        self.memory = ContextManager(agent_id, max_tokens=config.context_limit)
        self.tools = ToolRegistryClient(config.tool_registry_url)
        self.metrics = AgentMetrics(agent_id)

    async def run(self, task: AgentTask) -> AgentResult:
        start = time.monotonic()

        with tracer.start_as_current_span("agent.run") as span:
            span.set_attribute("agent.id", self.agent_id)
            span.set_attribute("agent.task_id", task.id)
            span.set_attribute("agent.session", task.session_id)

            try:
                result = await self._run_loop(task, span)
            except TokenBudgetExceeded as e:
                result = AgentResult(
                    task_id=task.id,
                    status=TaskStatus.COMPLETED,
                    output=e.partial_output,
                    error="token_budget_exceeded"
                )
            except Exception as e:
                span.record_exception(e)
                result = AgentResult(
                    task_id=task.id,
                    status=TaskStatus.FAILED,
                    error=str(e)
                )
            finally:
                result.duration_ms = int((time.monotonic() - start) * 1000)
                self.metrics.record(result)

            return result

    async def _run_loop(self, task: AgentTask, span) -> AgentResult:
        # Load available tools from registry
        tool_schemas = await self.tools.fetch(agent_id=self.agent_id)

        # Load and compress conversation history
        context = await self.memory.load(task.session_id)
        messages = build_messages(context, task.prompt)

        total_tokens = 0
        tool_call_count = 0

        for step in range(task.max_steps):
            span.set_attribute("agent.current_step", step)

            with tracer.start_as_current_span("agent.llm_call") as llm_span:
                response = await self._complete_with_retry(messages, tool_schemas)
                llm_span.set_attribute("llm.prompt_tokens", response.usage.prompt_tokens)
                llm_span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)

            total_tokens += response.usage.total_tokens

            if total_tokens > task.token_budget:
                raise TokenBudgetExceeded(
                    partial_output=response.content,
                    tokens_used=total_tokens
                )

            if response.finish_reason == "stop":
                await self.memory.save(task.session_id, messages + [response.message])
                return AgentResult(
                    task_id=task.id,
                    status=TaskStatus.COMPLETED,
                    output=response.content,
                    steps_used=step + 1,
                    tokens_used=total_tokens,
                    tool_calls=tool_call_count
                )

            if response.tool_calls:
                tool_call_count += len(response.tool_calls)
                results = await self._execute_tools(response.tool_calls)
                messages.append(response.message)
                messages.extend(tool_result_messages(results))

        # Hit max steps — return best available output
        return AgentResult(
            task_id=task.id,
            status=TaskStatus.COMPLETED,
            output=response.content,
            steps_used=task.max_steps,
            tokens_used=total_tokens,
            error="max_steps_reached"
        )

    @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(max=15))
    async def _complete_with_retry(self, messages, tools):
        return await self.llm.complete(messages=messages, tools=tools)

    async def _execute_tools(self, tool_calls):
        tasks = [self.tools.invoke(tc) for tc in tool_calls]
        return await asyncio.gather(*tasks, return_exceptions=True)

Inter-Agent Communication

Pattern Selection Matrix

gRPC Service Definition

For synchronous sub-agent calls, gRPC provides strong typing, bidirectional streaming, and efficient binary serialization.

// proto/agent_service.proto
syntax = "proto3";
package agents.v1;

service AgentService {
  rpc RunTask (TaskRequest) returns (TaskResponse);
  rpc StreamSteps (TaskRequest) returns (stream StepEvent);
  rpc Health (HealthRequest) returns (HealthResponse);
}

message TaskRequest {
  string task_id = 1;
  string session_id = 2;
  string prompt = 3;
  map<string, string> metadata = 4;
  int32 max_steps = 5;
  int32 token_budget = 6;
}

message TaskResponse {
  string task_id = 1;
  string status = 2;
  string output = 3;
  int32 steps_used = 4;
  int32 tokens_used = 5;
  string error = 6;
}

message StepEvent {
  int32 step_number = 1;
  string type = 2; // "llm_call" | "tool_call" | "tool_result"
  string content = 3;
}

Kafka Event Schema

For async pipeline handoffs between agents, use Avro or JSON schemas registered in a Schema Registry.

{
  "schema": {
    "type": "record",
    "name": "AgentTaskEvent",
    "namespace": "com.myco.agents.v1",
    "fields": [
      {"name": "task_id", "type": "string"},
      {"name": "source_agent", "type": "string"},
      {"name": "target_agent", "type": "string"},
      {"name": "session_id", "type": "string"},
      {"name": "prompt", "type": "string"},
      {"name": "context", "type": {"type": "map", "values": "string"}},
      {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
    ]
  }
}

Kafka Producer (in Orchestrator)

# In orchestrator when dispatching to agent-search
from aiokafka import AIOKafkaProducer
import json

async def dispatch_to_agent(target_agent: str, task: AgentTask):
    producer = AIOKafkaProducer(bootstrap_servers=KAFKA_BROKERS)
    await producer.start()
    try:
        event = {
            "task_id": task.id,
            "source_agent": "orchestrator",
            "target_agent": target_agent,
            "session_id": task.session_id,
            "prompt": task.prompt,
            "created_at": int(time.time() * 1000)
        }
        await producer.send_and_wait(
            topic=f"agent.tasks.{target_agent}",
            value=json.dumps(event).encode(),
            key=task.session_id.encode(), # partition by session
            headers=[("trace-id", get_current_trace_id().encode())]
        )
    finally:
        await producer.stop()

Tool Registry Service

Architecture

The Tool Registry is a centralized FastAPI service that stores, validates, and serves tool definitions. It acts as a typed API gateway for all agent→tool traffic.

Tool Registration Schema

# Tool self-registers on startup
class ToolDefinition(BaseModel):
    name: str
    version: str
    description: str
    parameters: Dict[str, Any] # JSON Schema
    returns: Dict[str, Any] # JSON Schema
    endpoint: str # where registry routes calls
    health_url: str
    auth_type: str # "api_key" | "oauth2" | "none"
    rate_limit: int # calls per minute per agent
    timeout_ms: int = 10000

# Registration call at tool service startup
@app.on_event("startup")
async def register_tool():
    registry = ToolRegistryClient(TOOL_REGISTRY_URL)
    await registry.register(ToolDefinition(
        name="web_search",
        version="2.1.0",
        description="Search the web and return ranked results",
        parameters={
            "type": "object",
            "properties": {
                "query": {"type": "string", "maxLength": 500},
                "num_results": {"type": "integer", "minimum": 1, "maximum": 20}
            },
            "required": ["query"]
        },
        returns={
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "url": {"type": "string"},
                    "title": {"type": "string"},
                    "snippet": {"type": "string"}
                }
            }
        },
        endpoint=f"{SERVICE_URL}/invoke",
        health_url=f"{SERVICE_URL}/health",
        auth_type="api_key",
        rate_limit=60,
        timeout_ms=8000
    ))

Registry Validation Layer

# Tool Registry validates before forwarding
async def invoke_tool(agent_id: str, tool_name: str, params: dict):
    tool = await db.get_tool(tool_name)

    if not tool:
        raise ToolNotFoundError(tool_name)

    # Validate against JSON Schema
    jsonschema.validate(params, tool.parameters) # raises on invalid input

    # Check rate limit
    if not await rate_limiter.check(agent_id, tool_name, tool.rate_limit):
        raise RateLimitExceeded(f"{tool_name} limit: {tool.rate_limit}/min")

    # Forward to tool service with timeout
    async with httpx.AsyncClient(timeout=tool.timeout_ms / 1000) as client:
        response = await client.post(
            tool.endpoint,
            json={"params": params},
            headers={"X-Agent-Id": agent_id, "X-Request-Id": str(uuid4())}
        )
        response.raise_for_status()
        return response.json()

Memory Architecture

Memory Tier Selection

ContextManager Implementation

# agent/memory.py
import json
from redis.asyncio import Redis
from qdrant_client import QdrantClient
from typing import List

class ContextManager:
    def __init__ (self, agent_id: str, max_tokens: int = 4096):
        self.agent_id = agent_id
        self.max_tokens = max_tokens
        self.redis = Redis.from_url(REDIS_URL)
        self.qdrant = QdrantClient(QDRANT_URL)
        self.embedder = EmbeddingClient()

    async def load(self, session_id: str) -> List[dict]:
        # 1. Load recent turns from Redis
        raw = await self.redis.get(f"session:{session_id}:messages")
        messages = json.loads(raw) if raw else []

        # 2. Retrieve semantically relevant past context
        if messages:
            last_user_msg = next(m for m in reversed(messages) if m["role"] == "user")
            embedding = await self.embedder.embed(last_user_msg["content"])
            relevant = await self.qdrant.search(
                collection_name=f"agent_{self.agent_id}_memory",
                query_vector=embedding,
                limit=3
            )
            # Prepend as system context
            for hit in relevant:
                messages.insert(0, {
                    "role": "system",
                    "content": f"[Past context] {hit.payload['summary']}"
                })

        # 3. Compress if over token limit
        return await self._compress_if_needed(messages)

    async def save(self, session_id: str, messages: List[dict]):
        # Save last 20 turns to Redis
        recent = messages[-20:]
        await self.redis.setex(
            f"session:{session_id}:messages",
            86400, # 24h TTL
            json.dumps(recent)
        )

        # If session is long, generate and store a summary in vector DB
        if len(messages) > 30:
            summary = await self._summarize(messages)
            embedding = await self.embedder.embed(summary)
            await self.qdrant.upsert(
                collection_name=f"agent_{self.agent_id}_memory",
                points=[{
                    "id": session_id,
                    "vector": embedding,
                    "payload": {"summary": summary, "session_id": session_id}
                }]
            )

    async def _compress_if_needed(self, messages: List[dict]) -> List[dict]:
        token_count = estimate_tokens(messages)
        if token_count <= self.max_tokens:
            return messages

        # Keep system messages + last N user/assistant turns
        system_msgs = [m for m in messages if m["role"] == "system"]
        recent_turns = messages[-12:] # last 6 exchanges
        return system_msgs + recent_turns

Context Window Management

Token Estimation

import tiktoken

def estimate_tokens(messages: list, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4 # per-message overhead
        total += len(enc.encode(msg.get("content", "") or ""))
        if "tool_calls" in msg:
            for tc in msg["tool_calls"]:
                total += len(enc.encode(json.dumps(tc)))
    return total

class TokenBudget:
    def __init__ (self, total: int, model: str):
        self.total = total
        self.model = model
        self.used = 0
        self.reserved = 1024 # always reserve for output

    @property
    def available_for_input(self):
        return self.total - self.reserved - self.used

    def consume(self, tokens: int):
        self.used += tokens
        if self.used > self.total - self.reserved:
            raise TokenBudgetExceeded(tokens_used=self.used)

Orchestrator & Supervisor Pattern

Orchestrator: Task Decomposition

The Orchestrator is itself an agent microservice, but its role is planning and coordination rather than execution.

# orchestrator/core.py
class OrchestratorAgent:
    async def execute(self, user_request: str, session_id: str) -> str:
        # Step 1: Decompose into a DAG of sub-tasks
        plan = await self.planner.decompose(user_request)
        # Returns: [{"id": "t1", "agent": "search", "task": "...", "deps": []},
        # {"id": "t2", "agent": "summarize", "task": "...", "deps": ["t1"]},
        # {"id": "t3", "agent": "email", "task": "...", "deps": ["t2"]}]

        # Step 2: Execute in topological order, parallel where possible
        results = {}
        for wave in topological_waves(plan):
            # All tasks in a wave have their deps satisfied
            wave_results = await asyncio.gather(*[
                self.supervisor.dispatch(step, results)
                for step in wave
            ])
            for step, result in zip(wave, wave_results):
                results[step["id"]] = result

        # Step 3: Synthesize final output
        return await self.synthesizer.merge(results, user_request)

def topological_waves(plan: list) -> list:
    """Return plan steps grouped into parallel execution waves."""
    completed = set()
    waves = []
    remaining = list(plan)
    while remaining:
        wave = [s for s in remaining if all(d in completed for d in s["deps"])]
        waves.append(wave)
        completed.update(s["id"] for s in wave)
        remaining = [s for s in remaining if s["id"] not in completed]
    return waves

Supervisor: Retry & Escalation

class Supervisor:
    def __init__ (self, agent_clients: dict):
        self.agent_clients = agent_clients

    async def dispatch(self, step: dict, context: dict) -> StepResult:
        task_prompt = self._inject_context(step["task"], context, step["deps"])

        for attempt in range(3):
            try:
                return await asyncio.wait_for(
                    self.agent_clients[step["agent"]].run(task_prompt),
                    timeout=60.0
                )
            except asyncio.TimeoutError:
                if attempt == 2:
                    raise SupervisorEscalation(step, "timeout_after_3_attempts")
                await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s
            except AgentError as e:
                if e.is_unrecoverable:
                    raise SupervisorEscalation(step, str(e))
                await asyncio.sleep(2 ** attempt)

    def _inject_context(self, task: str, results: dict, dep_ids: list) -> str:
        context_parts = [results[dep_id].output for dep_id in dep_ids if dep_id in results]
        if context_parts:
            return f"Context from previous steps:\n{chr(10).join(context_parts)}\n\nTask: {task}"
        return task

Security & Authorization

Agent Identity & JWT Verification

Each agent service must verify that incoming requests are from authorized callers. Use short-lived JWT tokens signed by an internal auth service.

# api/middleware.py
from fastapi import Request, HTTPException
from jose import jwt, JWTError

ALLOWED_CALLERS = {"orchestrator", "supervisor", "api-gateway"}

async def verify_agent_token(request: Request):
    token = request.headers.get("Authorization", "").removeprefix("Bearer ")
    if not token:
        raise HTTPException(status_code=401, detail="Missing auth token")
    try:
        payload = jwt.decode(token, PUBLIC_KEY, algorithms=["RS256"])
        caller = payload.get("sub")
        if caller not in ALLOWED_CALLERS:
            raise HTTPException(status_code=403, detail=f"Caller {caller} not authorized")
        request.state.caller = caller
    except JWTError as e:
        raise HTTPException(status_code=401, detail=f"Invalid token: {e}")

Secrets Management

Never store API keys in environment literals or ConfigMaps. Use Kubernetes Secrets mounted as environment variables, or preferably HashiCorp Vault with the Vault Agent Sidecar.

# k8s/deployment.yaml (secrets section)
env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: agent-secrets
        key: openai-api-key
  - name: TOOL_REGISTRY_TOKEN
    valueFrom:
      secretKeyRef:
        name: agent-secrets
        key: tool-registry-token

Tool Call Authorization

The Tool Registry enforces agent-level RBAC: which agents can invoke which tools.

# Tool Registry ACL check
TOOL_ACL = {
    "agent-search": ["web_search", "vector_search", "knowledge_base"],
    "agent-email": ["send_email", "get_email_thread"],
    "agent-code": ["code_exec", "git_read", "package_search"],
    "agent-data": ["sql_query", "csv_read", "chart_generate"],
}

async def check_tool_acl(agent_id: str, tool_name: str):
    allowed_tools = TOOL_ACL.get(agent_id, [])
    if tool_name not in allowed_tools:
        raise PermissionError(f"{agent_id} is not authorized to call {tool_name}")

Observability: Traces, Logs, Metrics

Distributed Tracing Setup (OpenTelemetry)

# observability/tracing.py
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

def setup_tracing(service_name: str):
    provider = TracerProvider(
        resource=Resource(attributes={SERVICE_NAME: service_name})
    )
    provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=OTEL_ENDPOINT))
    )
    trace.set_tracer_provider(provider)

    # Auto-instrument frameworks
    FastAPIInstrumentor().instrument()
    RedisInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

Standard Span Attributes for Agent Calls

Always set these attributes on every agent and LLM span:

# In AgentRunner._run_loop:
span.set_attribute("agent.id", self.agent_id)
span.set_attribute("agent.task_id", task.id)
span.set_attribute("agent.session_id", task.session_id)
span.set_attribute("agent.step", step)
span.set_attribute("llm.model", config.model)
span.set_attribute("llm.prompt_tokens", response.usage.prompt_tokens)
span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)
span.set_attribute("llm.finish_reason", response.finish_reason)

# In Tool Registry on invoke:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.version", tool.version)
span.set_attribute("tool.caller_agent", agent_id)
span.set_attribute("tool.latency_ms", latency_ms)

Prometheus Metrics

# observability/metrics.py
from prometheus_client import Counter, Histogram, Gauge

agent_tasks_total = Counter(
    "agent_tasks_total",
    "Total tasks processed",
    ["agent_id", "status"]
)

agent_task_duration = Histogram(
    "agent_task_duration_seconds",
    "Task end-to-end latency",
    ["agent_id"],
    buckets=[0.5, 1, 2, 5, 10, 30, 60, 120]
)

agent_llm_tokens = Counter(
    "agent_llm_tokens_total",
    "LLM tokens consumed",
    ["agent_id", "token_type"] # token_type: prompt | completion
)

agent_tool_calls = Counter(
    "agent_tool_calls_total",
    "Tool invocations",
    ["agent_id", "tool_name", "status"]
)

agent_steps_per_task = Histogram(
    "agent_steps_per_task",
    "Number of steps per task (runaway guard)",
    ["agent_id"],
    buckets=[1, 2, 3, 5, 8, 10, 15, 20, 25]
)

orchestrator_queue_depth = Gauge(
    "orchestrator_queue_depth",
    "Pending tasks in orchestrator queue"
)

Alert Rules

# alerting/rules.yaml
groups:
  - name: agent-alerts
    rules:
      - alert: AgentHighErrorRate
        expr: rate(agent_tasks_total{status="failed"}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "{{ $labels.agent_id }} failure rate above 5%"

      - alert: AgentRunawayTask
        expr: histogram_quantile(0.99, agent_steps_per_task) > 15
        for: 5m
        annotations:
          summary: "Agent tasks exceeding 15 steps — possible runaway loop"

      - alert: LLMTokenCostSpike
        expr: rate(agent_llm_tokens_total[10m]) > 50000
        for: 5m
        annotations:
          summary: "Token consumption rate spike — check for loops"

      - alert: AgentLatencyHigh
        expr: histogram_quantile(0.99, agent_task_duration_seconds) > 10
        for: 5m
        annotations:
          summary: "p99 task latency above 10s"

Structured Logging

# Never log raw prompts or PII. Log task IDs and outcome codes.
import structlog

log = structlog.get_logger()

log.info("agent.task.completed",
    task_id=task.id,
    session_id=task.session_id, # hashed in prod
    agent_id=self.agent_id,
    steps=result.steps_used,
    tokens=result.tokens_used,
    duration_ms=result.duration_ms,
    tool_calls=result.tool_calls,
    status=result.status,
    trace_id=get_current_trace_id()
)

Deployment on Kubernetes

Deployment Manifest

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-search
  labels:
    app: agent-search
    version: v1.4.2
    team: ai-platform
spec:
  replicas: 2
  selector:
    matchLabels:
      app: agent-search
  template:
    metadata:
      labels:
        app: agent-search
        version: v1.4.2
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8080"
    spec:
      serviceAccountName: agent-search
      containers:
        - name: agent
          image: registry.myco.io/agent-search@sha256:<digest> # Always pin by digest
          ports:
            - containerPort: 8080 # HTTP API
              name: http
            - containerPort: 50051 # gRPC
              name: grpc
          env:
            - name: AGENT_ID
              value: "agent-search"
            - name: TOOL_REGISTRY_URL
              valueFrom: {configMapKeyRef: {name: agent-config, key: tool-registry-url}}
            - name: REDIS_URL
              valueFrom: {secretKeyRef: {name: agent-secrets, key: redis-url}}
            - name: OPENAI_API_KEY
              valueFrom: {secretKeyRef: {name: agent-secrets, key: openai-api-key}}
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              valueFrom: {configMapKeyRef: {name: observability-config, key: otel-endpoint}}
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2"
              memory: "2Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 2
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"] # drain connections before shutdown
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels: {app: agent-search}

Horizontal Pod Autoscaler (Custom Metrics)

Scale on Kafka consumer lag and p99 task latency, not just CPU:

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-search-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-search
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300 # be conservative scaling down
  metrics:
    - type: External
      external:
        metric:
          name: kafka_consumer_group_lag
          selector:
            matchLabels:
              topic: agent.tasks.search
        target:
          type: AverageValue
          averageValue: "100"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

PodDisruptionBudget

Ensure at least one replica is always available during rolling updates:

# k8s/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: agent-search-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: agent-search

Scaling Strategies

Per-Agent Scaling Logic

Multi-Model Fallback

If the primary LLM is unavailable or rate-limited, automatically route to a fallback:

class LLMClient:
    MODEL_CASCADE = [
        "gpt-4o", # primary
        "gpt-4o-mini", # cheaper fallback
        "claude-sonnet-4-6", # cross-vendor fallback
    ]

    async def complete(self, messages: list, **kwargs) -> LLMResponse:
        for model in self.MODEL_CASCADE:
            try:
                return await self._call_model(model, messages, **kwargs)
            except (RateLimitError, ModelUnavailable):
                log.warning("llm.fallback", from_model=model, reason="rate_limit_or_unavailable")
                continue
        raise AllModelsUnavailable()

Fault Tolerance & Retry Strategies

Circuit Breaker on LLM Client

from circuitbreaker import circuit

class LLMClientWithCircuitBreaker:
    @circuit(failure_threshold=5, recovery_timeout=30, expected_exception=LLMError)
    async def complete(self, messages: list, **kwargs) -> LLMResponse:
        return await self._raw_complete(messages, **kwargs)

The circuit opens after 5 consecutive failures and remains open for 30 seconds, serving fallback responses or routing to a secondary model during that window.

Exponential Backoff with Jitter

from tenacity import (
    retry, stop_after_attempt,
    wait_exponential_jitter, retry_if_exception_type
)

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential_jitter(initial=1, max=60),
    retry=retry_if_exception_type((RateLimitError, TimeoutError, ServiceUnavailable))
)
async def call_tool_with_retry(tool_name: str, params: dict):
    return await tool_registry.invoke(tool_name, params)

Dead Letter Queue Handler

# dlq_handler.py — consumes from dead-letter topic
class DLQHandler:
    async def process(self, event: AgentTaskEvent):
        log.error("agent.task.dlq",
            task_id=event.task_id,
            target_agent=event.target_agent,
            attempt_count=event.retry_count,
            original_error=event.last_error
        )

        # Alert on-call if error is novel
        if await self.is_novel_error(event.last_error):
            await self.pagerduty.alert(event)

        # Store for human review dashboard
        await self.db.insert_dlq_item(event)

        # Auto-re-queue with modified params after 1 hour (optional)
        if event.retry_count < 2 and event.auto_retry_eligible:
            await asyncio.sleep(3600)
            event.retry_count += 1
            await self.kafka.send("agent.tasks." + event.target_agent, event)

Step-Level Checkpointing

class CheckpointedAgentRunner(AgentRunner):
    async def _run_loop(self, task: AgentTask, span) -> AgentResult:
        # Restore from checkpoint if available
        checkpoint = await self.redis.get(f"checkpoint:{task.id}")
        if checkpoint:
            state = json.loads(checkpoint)
            messages = state["messages"]
            total_tokens = state["total_tokens"]
            start_step = state["step"] + 1
            log.info("agent.checkpoint.restored", task_id=task.id, step=start_step)
        else:
            context = await self.memory.load(task.session_id)
            messages = build_messages(context, task.prompt)
            total_tokens = 0
            start_step = 0

        for step in range(start_step, task.max_steps):
            response = await self._complete_with_retry(messages, tool_schemas)
            messages.append(response.message)

            # Persist checkpoint after each step
            await self.redis.setex(
                f"checkpoint:{task.id}",
                3600,
                json.dumps({"messages": messages, "total_tokens": total_tokens, "step": step})
            )

            if response.finish_reason == "stop":
                await self.redis.delete(f"checkpoint:{task.id}")
                break

        return build_result(task, response, total_tokens, step)

Testing Agent Microservices

Testing Pyramid

# tests/unit/test_agent_runner.py
import pytest
from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    llm = AsyncMock()
    llm.complete.return_value = LLMResponse(
        content="Here is the search result.",
        finish_reason="stop",
        usage=Usage(prompt_tokens=100, completion_tokens=50, total_tokens=150)
    )
    return llm

async def test_agent_completes_in_one_step(mock_llm):
    runner = AgentRunner("agent-search", test_config)
    runner.llm = mock_llm

    result = await runner.run(AgentTask(id="t1", session_id="s1", prompt="find AI news"))

    assert result.status == TaskStatus.COMPLETED
    assert result.steps_used == 1
    assert result.tokens_used == 150
    mock_llm.complete.assert_called_once()

async def test_agent_respects_token_budget(mock_llm):
    mock_llm.complete.return_value = LLMResponse(
        content="...", finish_reason="tool_calls",
        usage=Usage(prompt_tokens=900, completion_tokens=100, total_tokens=1000)
    )
    task = AgentTask(id="t1", session_id="s1", prompt="...", token_budget=500)
    runner = AgentRunner("agent-search", test_config)
    runner.llm = mock_llm

    result = await runner.run(task)
    assert result.error == "token_budget_exceeded"

Integration Testing with a Mock LLM Server

Use a local mock LLM server (e.g., wiremock or a FastAPI stub) that returns deterministic responses for testing tool call flows end-to-end without hitting real APIs.

# tests/integration/test_tool_flow.py
async def test_search_agent_calls_web_search_tool(mock_llm_server, real_redis, real_tool_registry):
    # Configure mock LLM to respond with a tool call on first turn
    mock_llm_server.set_response(step=0, response=TOOL_CALL_RESPONSE)
    mock_llm_server.set_response(step=1, response=FINAL_RESPONSE)

    runner = AgentRunner("agent-search", integration_config)
    result = await runner.run(AgentTask(id="t1", session_id="s1", prompt="Search for AI news"))

    assert result.status == TaskStatus.COMPLETED
    assert result.tool_calls == 1
    assert real_tool_registry.was_invoked("web_search")

Chaos Testing

Use Chaos Mesh or Litmus to test resilience:

Pod kill: Kill a random agent pod — verify Supervisor retries succeed
Network partition: Block agent→tool-registry traffic — verify circuit breaker opens
LLM latency injection: Add 15s delay to LLM calls — verify timeout and fallback activate
Kafka partition leader election: Simulate Kafka failover — verify no task loss via consumer offset management

CI/CD Pipeline for Agent Services

# .github/workflows/agent-service.yml
name: Agent Service CI/CD

on:
  push:
    paths: ["agents/agent-search/**"]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      redis:
        image: redis:7-alpine
        ports: ["6379:6379"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: "3.12"}
      - run: pip install -e ".[dev]"
      - run: pytest tests/ --cov=agent --cov-fail-under=85

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with: {image-ref: "agent-search:${{ github.sha }}", exit-code: "1"}

  build-push:
    needs: [test, security-scan]
    runs-on: ubuntu-latest
    steps:
      - name: Build and push (pinned by digest)
        run: |
          docker buildx build --platform linux/amd64,linux/arm64 \
            -t registry.myco.io/agent-search:${{ github.sha }} \
            --push agents/agent-search/
          # Capture digest for deployment
          DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' \
            registry.myco.io/agent-search:${{ github.sha }})
          echo "IMAGE_DIGEST=$DIGEST" >> $GITHUB_ENV

  deploy-staging:
    needs: build-push
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/agent-search \
            agent=registry.myco.io/agent-search@${{ env.IMAGE_DIGEST }} \
            -n staging
          kubectl rollout status deployment/agent-search -n staging --timeout=120s

  smoke-test-staging:
    needs: deploy-staging
    steps:
      - run: python tests/smoke/run_smoke_tests.py --env staging

  deploy-production:
    needs: smoke-test-staging
    environment: production
    steps:
      - name: Rolling deploy to production
        run: |
          kubectl set image deployment/agent-search \
            agent=registry.myco.io/agent-search@${{ env.IMAGE_DIGEST }} \
            -n production
          kubectl rollout status deployment/agent-search -n production --timeout=300s

Cost Management & Token Budgeting

Per-Agent Token Accounting

Track token usage per agent, per session, and per user to enable chargebacks and anomaly detection.

class TokenAccountant:
    async def record(self, agent_id: str, session_id: str, usage: Usage):
        # Increment per-agent daily counter
        await self.redis.incrby(f"tokens:{agent_id}:{today()}", usage.total_tokens)
        await self.redis.expire(f"tokens:{agent_id}:{today()}", 86400 * 7)

        # Increment per-session counter (for user billing)
        await self.redis.incrby(f"tokens:session:{session_id}", usage.total_tokens)

        # Write to time-series DB for cost dashboards
        await self.influx.write(
            measurement="llm_tokens",
            tags={"agent_id": agent_id, "model": usage.model},
            fields={"prompt": usage.prompt_tokens, "completion": usage.completion_tokens},
        )

async def get_estimated_cost(agent_id: str) -> float:
    tokens = int(await redis.get(f"tokens:{agent_id}:{today()}") or 0)
    # GPT-4o pricing: $2.50/1M prompt, $10/1M completion (example)
    return (tokens / 1_000_000) * 5.0 # blended estimate

Budget Enforcement at Session Level

MAX_SESSION_TOKENS = 50_000 # hard cap per user session

async def check_session_budget(session_id: str):
    used = int(await redis.get(f"tokens:session:{session_id}") or 0)
    if used > MAX_SESSION_TOKENS:
        raise SessionBudgetExceeded(
            session_id=session_id,
            tokens_used=used,
            limit=MAX_SESSION_TOKENS
        )

Production Readiness Checklist

Service-Level Requirements

Agent has /health endpoint that checks LLM client connectivity
Agent has /ready endpoint that checks memory store (Redis) and Tool Registry reachability
All tool calls are schema-validated by Tool Registry before execution
Agent-level RBAC enforced: agent X cannot invoke tools it is not authorized for
JWT verification on all inter-agent gRPC and HTTP calls
Secrets loaded from Kubernetes Secrets or Vault — never from env literals or ConfigMaps

Reliability Requirements

Context window size is bounded — no unbounded message history growth
Token budget enforced per task with hard ceiling
MAX_STEPS guard in place to prevent runaway loops
Exponential backoff with jitter on all LLM calls
Circuit breaker configured on LLM client (threshold, recovery timeout)
Exponential backoff on all tool calls
Failed tasks routed to Dead Letter Queue — not silently dropped
Step-level checkpointing for tasks expected to exceed 60 seconds
Multi-model fallback cascade configured (primary → cheaper → cross-vendor)

Observability Requirements

OpenTelemetry distributed tracing with trace context propagation
All LLM completions traced with token counts and latency
All tool calls traced with tool name, version, and outcome
Prometheus metrics exported: task count, duration, token usage, tool calls, step count
Alerts configured: high error rate, runaway steps, token cost spike, high latency
Structured logging (JSON) with task_id, session_id (hashed), trace_id — no raw prompt content

Deployment Requirements

Agent image pinned to digest, not mutable tag (never :latest)
HPA configured with appropriate metrics (queue lag and latency, not just CPU)
PodDisruptionBudget set (minAvailable >= 1)
Pod topology spread constraints configured for HA across nodes
Resource requests and limits set (no QoS class “BestEffort”)
Rolling update strategy with preStop sleep for graceful shutdown
Integration tests cover “tool call fails → agent recovers” path
Load tests simulate 10× expected peak concurrency before go-live

Cost Control Requirements

Token usage recorded per agent and per session
Session-level budget cap enforced
Token cost alerting configured per agent
DLQ monitored — no silent retry storms

Reference Architecture Diagram

What Happens Inside an LLM During Inference: Tokens, KV Cache, and GPU Execution Explained

Kotcherla Murali Krishna — Sat, 23 May 2026 14:37:31 +0000

You type a prompt. You hit Enter. In under two seconds, a response starts streaming back — word by word, almost like a human typing in real time.

But what actually happens between your keypress and that first token appearing on screen?

Inside the server, a sequence of events unfolds that involves tokenization, billions of matrix multiplications, carefully scheduled GPU kernel launches, a memory system called the KV cache, and a probabilistic sampling process — all happening in microseconds per token.

This article takes you inside the machine. We’ll trace a single inference request from raw text to streamed response, layer by layer, operation by operation. By the end, you’ll understand not just what LLMs do, but how modern systems execute them at scale — and why it’s so expensive.

Tokenization Explained

Before any neural network sees your input, it has to become numbers. But it doesn’t convert character by character — it converts subword chunks called tokens.

Modern LLMs use Byte Pair Encoding (BPE) or variants like SentencePiece. The vocabulary (typically 32K–128K tokens) is learned during training by iteratively merging frequent byte pairs until a fixed vocab size is reached.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "What happens inside an LLM during inference?"
tokens = tokenizer(text)
print(tokens["input_ids"])
# [128000, 3923, 8741, 4871, 459, 445, 11237, 2391, 45478, 30]
print(tokenizer.convert_ids_to_tokens(tokens["input_ids"]))
# ['<|begin_of_text|>', 'What', ' happens', ' inside', ' an', ' L', 'LM', ' during', ' inference', '?']

💡 Key insight: “LLM” becomes two tokens: L and LM. Tokenization is not intuitive — it's frequency-driven. Common words are single tokens; rare or technical terms split.

The output is a sequence of integer IDs — the true input to the model. Token count determines compute cost directly: more tokens = more compute.

Embeddings and Vector Representations

Token IDs are integers. Neural networks need continuous-valued vectors. The embedding layer is a giant lookup table: a matrix of shape [vocab_size × d_model] where d_model is typically 4096 (for 7B-class models) or 8192 (for 70B+).

import torch
import torch.nn as nn

vocab_size = 128_000
d_model = 4096

embedding = nn.Embedding(vocab_size, d_model)

# Token IDs → embedding vectors
token_ids = torch.tensor([3923, 8741, 4871, 459])
vectors = embedding(token_ids) # shape: [4, 4096]

Each token becomes a 4096-dimensional vector floating in a high-dimensional space. Semantically similar tokens cluster together — “king” and “queen” are nearby; “inference” and “forward pass” are closer than “inference” and “potato.”

Positional encoding is then added to inject sequence order, since the transformer itself is permutation-invariant:

# Rotary Positional Embedding (RoPE) — used in LLaMA, Mistral, Gemma
# Encodes position by rotating query/key vectors in frequency space
# No separate positional embedding matrix needed

🔑 Modern LLMs use RoPE (Rotary Position Embedding) instead of the classic sinusoidal embeddings from the original “Attention is All You Need” paper. RoPE encodes relative position directly into the attention computation, allowing better generalization to longer sequences.

Transformer Layers Explained Visually

A transformer is a stack of identical layers — GPT-3 has 96, LLaMA-3 8B has 32, LLaMA-3 70B has 80. Each layer has two main sublayers:

Multi-Head Self-Attention (MHSA)
Feed-Forward Network (FFN)

With residual connections and layer normalization (typically pre-norm in modern models) wrapping each.

💡 Why residual connections? They allow gradients to flow directly during training, preventing vanishing gradients. During inference, they mean each layer refines the representation rather than replacing it — the model builds understanding incrementally.

Modern LLMs replace standard LayerNorm with RMSNorm (root mean square normalization) — computationally cheaper, empirically equivalent. The FFN uses SwiGLU activation instead of classic ReLU, adding a gating mechanism that improves expressivity.

Self-Attention Mechanism

Self-attention is the heart of the transformer. It lets every token look at every other token and decide what to attend to.

For each token, three vectors are computed via learned linear projections:

Q (Query) — “What am I looking for?”
K (Key) — “What do I contain?”
V (Value) — “What do I output if attended to?”

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)

    # Attention scores: how much each token attends to every other
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Softmax over keys dimension → attention weights
    weights = F.softmax(scores, dim=-1)

    # Weighted sum of values
    return torch.matmul(weights, V), weights

# Multi-head: run attention H times in parallel, concat results
# Each head learns different relationship types

Multi-head attention runs this H times in parallel (e.g., 32 heads for a 7B model), each with a smaller dimension d_k = d_model / H. Each head specializes — some learn syntactic relationships, others semantic, others coreference.

⚠️ Complexity: Self-attention is O(n²) in sequence length — doubling the context quadruples the attention computation. This is why long-context models (128K+ tokens) need specialized attention algorithms like FlashAttention.

FlashAttention (Dao et al., 2022) restructures the computation to stay within GPU SRAM rather than repeatedly reading/writing to HBM, achieving 2–4× speedup without changing the math.

Matrix Multiplication on GPUs

Every linear projection in a transformer (Q, K, V projections, FFN weights, output projections) is a matrix multiplication (matmul). For a batch of tokens and a weight matrix:

Output [batch × seq × d_out] = Input [batch × seq × d_in] × W [d_in × d_out]

For LLaMA-3 8B with d_model=4096, this is a 4096×4096 matmul per projection — millions of multiply-accumulate operations.

GPUs are built for exactly this. An H100 delivers ~1,979 TFLOPS of BF16 tensor core throughput. The secret: tensor cores — specialized hardware that computes 4×4 or 8×4 matrix fragments in a single clock cycle.

Memory bandwidth is the bottleneck, not compute. During token generation (decode phase), the GPU loads enormous weight matrices from HBM for each token — but performs relatively few FLOPs per byte loaded. This is called being memory-bound (low arithmetic intensity).

📊 Arithmetic Intensity = FLOPs / bytes accessed

Prefill phase: ~high intensity → compute-bound

Decode phase: ~low intensity → memory-bound

This asymmetry is why prefill and decode need different optimization strategies.

CUDA Kernels and Tensor Operations

When PyTorch executes torch.matmul(), it doesn't run one monolithic computation — it dispatches a CUDA kernel : a compiled GPU function that runs in parallel across thousands of threads.

import torch

A = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)

# This dispatches a cuBLAS GEMM kernel under the hood
C = torch.matmul(A, B) # ~137 billion FLOPs in milliseconds

The GPU execution stack for a single matmul:

PyTorch op → Dispatches cuBLAS → Selects optimal GEMM kernel 
→ Divides matrix into tiles → Assigns tiles to Streaming Multiprocessors (SMs)
→ Each SM loads tile to shared memory (SRAM)
→ Tensor cores compute 4×4 fragments
→ Results written back to HBM

Kernel fusion is a critical optimization: instead of launching separate kernels for attention scores, softmax, and matmul, fused kernels like FlashAttention do it all in one pass — dramatically reducing HBM traffic.

# Flash Attention via PyTorch SDPA (scaled dot product attention)
from torch.nn.functional import scaled_dot_product_attention

# Automatically uses FlashAttention backend when available
output = scaled_dot_product_attention(Q, K, V, is_causal=True)

KV Cache During Inference

This is one of the most important engineering decisions in LLM inference.

During generation, the model processes the full prompt once — but for each new token generated, it only needs to run attention for that one new token against all prior tokens. Without caching, you’d recompute K and V vectors for the entire history on every step.

The KV cache stores the Key and Value tensors for all previously processed tokens:

# Simplified KV cache structure
kv_cache = {
    layer_idx: {
        "k": torch.zeros(batch_size, num_heads, max_seq_len, head_dim),
        "v": torch.zeros(batch_size, num_heads, max_seq_len, head_dim),
        "length": 0 # current filled position
    }
    for layer_idx in range(num_layers)
}

def attention_with_cache(Q_new, layer_idx, cache):
    pos = cache[layer_idx]["length"]

    # Append new K, V
    cache[layer_idx]["k"][:, :, pos, :] = K_new
    cache[layer_idx]["v"][:, :, pos, :] = V_new
    cache[layer_idx]["length"] += 1

    # Attend over full cached history
    K_full = cache[layer_idx]["k"][:, :, :pos+1, :]
    V_full = cache[layer_idx]["v"][:, :, :pos+1, :]

    return scaled_dot_product_attention(Q_new, K_full, V_full)

💾 Memory cost of KV cache:

2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element

For LLaMA-3 70B (BF16): ~10 GB for a single 8K-token sequence.

This is why long-context inference is extremely memory-intensive.

Grouped Query Attention (GQA) — used in LLaMA-3, Mistral, and others — reduces this by sharing K/V heads across groups of Q heads, cutting KV cache size by 4–8×.

Prefill vs Decode Phase

Inference has two fundamentally different phases with different compute profiles:

Prefill processes all prompt tokens in one large parallel forward pass. It’s compute-intensive — the GPU’s tensor cores are fully utilized. For a 1000-token prompt, this might take 50–200ms on an H100.

Decode generates tokens one by one. Each decode step:

Runs a forward pass for one new token
Reads the full KV cache (all prior tokens) from HBM
Appends new K/V entries to cache
Samples one token from the output distribution

Decode is slow because loading the KV cache and model weights per step is memory-bandwidth-limited. An H100 generates ~100–200 tokens/second for a 70B model — far below its theoretical FLOP peak.

⚡ Time To First Token (TTFT) is dominated by prefill.

Inter-Token Latency (ITL) is determined by decode throughput.

These are the two key SLAs in production inference systems.

Token Sampling Strategies

After the final transformer layer, a linear projection maps d_model → vocab_size, followed by softmax to produce a probability distribution over all tokens. Then sampling:

import torch
import torch.nn.functional as F

def sample_token(logits, temperature=0.8, top_p=0.9, top_k=50):
    # Temperature scaling - higher = more random
    logits = logits / temperature

    # Top-K filtering - only consider top K tokens
    if top_k > 0:
        top_k_values, _ = torch.topk(logits, top_k)
        threshold = top_k_values[..., -1, None]
        logits = logits.masked_fill(logits < threshold, float('-inf'))

    # Top-P (nucleus) sampling - smallest set summing to probability p
    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative prob above threshold
        sorted_indices_to_remove = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
        sorted_logits[sorted_indices_to_remove] = float('-inf')

        logits = torch.zeros_like(logits).scatter_(0, sorted_indices, sorted_logits)

    # Sample from filtered distribution
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

🎲 Practical defaults: Temperature 0.6–0.8 for chat, Top-P 0.9, Top-K 40–100. Coding tasks often use lower temperature (0.2–0.4) for more deterministic outputs. Use temperature=0 for greedy decoding (deterministic reproduction).

Streaming Responses

When ChatGPT “types” a response, it’s not buffering the full answer — it’s streaming each token as it’s generated, using Server-Sent Events (SSE).

# FastAPI streaming endpoint (simplified)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def token_stream(prompt: str):
    async for token in model.generate_stream(prompt):
        # SSE format: "data: {token}\n\n"
        yield f"data: {token}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/generate")
async def generate(prompt: str):
    return StreamingResponse(
        token_stream(prompt),
        media_type="text/event-stream"
    )

On the client side:

const response = await fetch('/generate', { method: 'POST', body: prompt });
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n\n');

    for (const line of lines) {
        if (line.startsWith('data: ') && line !== 'data: [DONE]') {
            const token = line.slice(6);
            displayToken(token); // Append to UI
        }
    }
}

The user experience of “streaming” emerges from the model’s decode loop — each iteration generates one token (or a few), which is immediately sent over the wire. The GPU is generating while the client is rendering.

Inference Engines

Raw PyTorch is not how production LLMs are served. Specialized inference engines add layers of optimization:

vLLM

The most widely deployed open-source inference engine. Key innovations:

PagedAttention: Inspired by OS virtual memory paging — the KV cache is divided into fixed-size blocks (“pages”), allocated dynamically. Eliminates internal fragmentation, enabling 2–4× higher throughput than naive implementations.
Continuous batching (detailed below)
Tensor parallelism across multiple GPUs

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel_size=2)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Explain LLM inference in detail"], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

TensorRT-LLM

NVIDIA’s production inference engine. Compiles models to optimized TensorRT engines:

Kernel fusion (combines multiple ops into single CUDA kernels)
INT8/FP8 quantization with calibration
In-flight batching
Multi-GPU tensor + pipeline parallelism
Specifically tuned for NVIDIA hardware (A100/H100)

Hugging Face TGI (Text Generation Inference)

The most developer-accessible production server:

Flash Attention and Paged Attention backends
Continuous batching
Token streaming via gRPC and HTTP
Used internally at HuggingFace for the Inference API

# Deploy LLaMA-3 8B with TGI
docker run --gpus all -p 8080:80 \
  -v $PWD/models:/data \
  ghcr.io/huggingface/text-generation-inference:2.0 \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct \
  --num-shard 1 \
  --max-input-length 4096 \
  --max-total-tokens 8192

Continuous Batching

Traditional static batching waits for a full batch of requests before starting inference — meaning requests that arrive mid-generation wait for the whole batch to finish. Terrible for latency.

Continuous batching (also called in-flight batching) processes requests at the iteration level:

In continuous batching, every decode iteration can add new requests and retire finished ones. GPU utilization stays high regardless of heterogeneous sequence lengths. vLLM, TGI, and TensorRT-LLM all implement this.

📈 Throughput impact: Continuous batching typically achieves 5–10× higher throughput than static batching at the same latency SLA.

GPU Memory Bottlenecks

GPU memory is the scarce resource in LLM inference. An H100 SXM5 has 80 GB of HBM3e. For a 70B parameter model in BF16:

Model weights: 70B × 2 bytes = 140 GB → needs 2× H100s minimum (tensor parallel)
KV cache (8K ctx): ~10 GB per request
Activations: ~1–2 GB per request
CUDA overhead: ~1–2 GB

Quantization is the primary lever for fitting larger models:

GPTQ (post-training quantization) and AWQ (activation-aware weight quantization) are the dominant 4-bit quantization methods for inference — fitting a 70B model on a single H100 while preserving most quality.

Speculative decoding is another technique: a smaller “draft” model generates candidate tokens quickly; the large “verifier” model checks them in parallel. If accepted, you get multiple tokens per large-model forward pass — reducing the memory-bandwidth bottleneck of decode.

Scaling Inference to Millions of Users

Serving a single model to millions of concurrent users requires distributed systems engineering layered on top of GPU optimization:

Key scaling strategies:

Tensor Parallelism (TP): Split each weight matrix across multiple GPUs. Each GPU computes a shard; results are all-reduced via NVLink. Used within a node.

Pipeline Parallelism (PP): Split transformer layers across GPUs — GPU 1 runs layers 1–20, GPU 2 runs layers 21–40, etc. Used across nodes.

Data Parallelism: Run multiple full model replicas, each serving a subset of requests. The simplest form — scale out by adding replicas.

Prompt caching / prefix caching: For requests sharing a long system prompt (e.g., all users of the same chatbot share the same 2000-token system prompt), the KV cache for that prefix is computed once and reused. Anthropic, OpenAI, and Google all offer this as a feature — reducing cost and latency for shared prefixes.

KV cache offloading: Move KV cache entries for paused requests from GPU HBM to CPU RAM or NVMe SSD, freeing GPU memory for active requests. Trades latency for capacity.

🏗️ Production deployment stack (typical):

Kubernetes + GPU operator → vLLM / TRT-LLM serving pods → Prometheus + Grafana monitoring → OpenTelemetry tracing → Autoscaling on queue depth

Final Architecture Walkthrough

Let’s trace a single request end-to-end through a production LLM inference system:

Timing breakdown for a 1000-token prompt → 200-token response on H100:

🔮 The big bet: The industry is moving toward inference-time compute as a primary scaling axis — not just training. This means inference systems will bear an increasingly large share of the total AI compute budget, making every optimization described in this article more important, not less.

Closing

When you send a prompt to an LLM, you’re setting off a cascade of engineering decisions made by hundreds of researchers and engineers — from the BPE tokenizer vocabulary to the CUDA kernel that fuses attention, from the paging algorithm managing KV cache memory to the SSE stream delivering tokens to your browser.

The transformer is elegant math. Production inference is brutal systems engineering.

Understanding both levels — the math and the metal — is what separates engineers who use LLMs from engineers who can build, optimize, and scale them.

KV Cache Explained Like You're an LLM Engineer

Kotcherla Murali Krishna — Wed, 20 May 2026 06:20:37 +0000

How transformer inference actually works under the hood — and why KV cache is the single most important optimization keeping your LLM from crawling.

If you've ever wondered why LLMs respond fast even on long prompts — the answer is KV cache. But most explanations stop at "it stores keys and values." This goes deeper.

What You'll Learn
By the end of this article you'll understand:

Why autoregressive LLM generation is expensive by design
What attention actually computes — and why recomputing it is wasteful
The difference between prefill and decode phases
How KV cache grows and when it becomes your GPU's worst enemy
How vLLM's PagedAttention solved memory fragmentation
What's coming next: quantized cache, sliding window, speculative decoding

Introduction: Why LLM Inference is Expensive
Let's start with an uncomfortable truth.
When you send a prompt to GPT-4 or Claude and watch that first token appear, your GPU has just burned through millions of floating-point operations before producing a single character. And then, for every subsequent token in the response — it does it again.
Not a small version. The full computation. Attention over the entire sequence. Every time.

Without optimization, a 7B parameter model generating a 200-token response would recompute attention across the full growing sequence 200 times. For a 70B model on a context of 4,096 tokens, that's not slow — it's practically unusable in production.
This is the core economics problem of LLM inference: autoregressive generation is inherently sequential and expensive. You can't parallelize generation across output tokens the way you parallelize training across a batch. Each new token depends on every token that came before it.

KV cache is the engineering solution that makes modern LLM inference economically viable. It's not magic — it's a deliberate memory-compute tradeoff. Understanding it deeply is the difference between an ML engineer who deploys models and one who optimizes them.

What Happens During Token Generation

Before we talk about caching, let's understand what we're caching from.
LLMs generate text one token at a time, left to right. Each generation step:

Takes the full input prompt + all previously generated tokens
Runs a complete forward pass through the transformer
Produces a probability distribution over the vocabulary
Samples the next token from that distribution

Then the new token is appended to the sequence, and the process repeats.
Here's that loop in pseudocode:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

input_ids = tokenizer.encode("The capital of France is", return_tensors="pt")

generated = input_ids
for _ in range(50):  # generate 50 tokens
    with torch.no_grad():
        # Full forward pass every single step — expensive!
        outputs = model(generated)
        logits = outputs.logits[:, -1, :]        # last token's logits
        next_token = torch.argmax(logits, dim=-1)
        generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)

print(tokenizer.decode(generated[0]))

Notice the problem: at step 50, generated has 50+ tokens. We're doing a full transformer forward pass over all of them just to predict token 51. The computation keeps growing linearly with sequence length.

The question is: why do we need to reprocess all previous tokens every time?

Transformer Attention: A Quick Refresher
To understand why KV cache exists, you need to understand what attention actually computes.
The core of transformer inference is multi-head self-attention. For each layer and each token position, attention computes three projections:

Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "What do I actually carry?"

The attention output for position i is:

Attention(Q_i, K, V) = softmax(Q_i · K^T / √d_k) · V

In words: token i asks a question (Q), broadcasts it across all token keys (K) to get attention weights, then uses those weights to take a weighted sum of all values (V).
For a sequence of n tokens, each token attends to all n tokens. This is O(n²) in both time and memory — which is why long contexts hurt so much.

Here's the key insight:

Q changes at every step. But K and V for already-processed tokens do NOT change.
Once a token has been processed by the transformer, its Key and Value projections are fixed. They only depend on the token's content and position — not on future tokens.

This is the mathematical justification for KV cache.

Why Recomputing Attention is Inefficient
Let's make this concrete with numbers.
A standard LLaMA-2 7B model has:

32 transformer layers
32 attention heads
Hidden dimension of 4096
KV head dimension of 128

For a single token, the K and V projections at one layer are each vectors of size 128. Across 32 heads and 32 layers, storing the KV state for one token costs:

2 (K and V) × 32 (layers) × 32 (heads) × 128 (head_dim) × 2 bytes (fp16)
= 524,288 bytes ≈ 0.5 MB per token

Now imagine a prompt of 2,048 tokens:

2,048 tokens × 0.5 MB = 1 GB of KV state

Without caching, every decode step recomputes that entire 1 GB of KV state from scratch. With 200 decode steps, you're recomputing 200 GB of equivalent computation — just to generate a few hundred words.
The recompute path also saturates memory bandwidth. On an A100 (2 TB/s bandwidth), even just reading model weights once per step for a 13B model takes:

13B params × 2 bytes/param = 26 GB per forward pass
26 GB / 2000 GB/s = ~13ms per token → ~77 tokens/sec ceiling

Any redundant recomputation cuts directly into this budget.

What KV Cache Stores
The KV cache stores the Key and Value tensors for all previously processed tokens, so they don't need to be recomputed on subsequent decode steps.
Here's the conceptual layout:

At each decode step, the model:

Computes Q only for the new token
Computes K and V for the new token
Appends new K/V to the cache
Runs attention using new Q against all cached K/V
Returns the output logit for the new token

Attention with KV cache:

def attention_with_cache(query_new, key_cache, value_cache, key_new, value_new):
    # Append new K/V to cache
    keys   = torch.cat([key_cache,   key_new],   dim=1)
    values = torch.cat([value_cache, value_new], dim=1)

    # New query attends to ALL keys (cached + new)
    scores  = torch.einsum("bhd,bshd->bhs", query_new, keys) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    output  = torch.einsum("bhs,bshd->bhd", weights, values)

    return output, keys, values  # return updated cache

Result: instead of O(n) recompute per step, we do O(1) query computation — a constant cost regardless of how long the sequence has grown.

Prefill Phase vs. Decode Phase
Production LLM inference has two fundamentally different operating modes.
Prefill Phase

When you first submit a prompt, the model processes all prompt tokens in parallel:

Prompt: [T_0, T_1, T_2, ..., T_2047]
          ↓    ↓    ↓         ↓
    [All processed simultaneously via batch matrix ops]
          ↓
    [Full KV cache populated for all 2048 positions]
          ↓
    [First output token generated]

Prefill is compute-bound — large matrix multiplications across all positions at once. GPUs excel here.

Prefill time = your Time to First Token (TTFT). Users feel this as the delay before the first character appears.
Decode Phase

After prefill, each new token is generated one at a time:

Step 1: New token T_2048
        → Compute Q, K, V for T_2048 only
        → Append K_2048, V_2048 to cache
        → Attend over positions 0..2048
        → Sample T_2049

Step 2: New token T_2049
        → Cache now has 2050 entries
        → Attend over positions 0..2049
        → Sample T_2050

Decode is memory-bandwidth-bound — constantly reading the growing KV cache from GPU HBM. Compute cores sit largely idle waiting for memory reads.

Decode step memory reads (LLaMA 7B, 2048 cache):
  Model weights:  ~14 GB
  KV cache:       ~1 GB (and growing)
  Total:          ~15 GB

At 2 TB/s bandwidth: ~7.5ms/token → ~133 tokens/sec theoretical max

GPU Memory and KV Cache Growth
Here's where things get expensive fast.

KV cache size =
  batch_size × seq_len × num_layers × num_heads × head_dim × 2 × dtype_bytes

Real example — LLaMA-2 13B, batch size 8, 4K context:

batch_size   = 8
seq_len      = 4096
num_layers   = 40
num_kv_heads = 40
head_dim     = 128
dtype_bytes  = 2      # fp16

kv_cache_bytes = (
    batch_size * seq_len * num_layers
    * num_kv_heads * head_dim * 2 * dtype_bytes
)
print(f"KV Cache: {kv_cache_bytes / 1e9:.2f} GB")
# KV Cache: 26.84 GB

Now scale to batch 16 at 8K context:

batch_size = 16
seq_len    = 8192
# KV Cache: 107.37 GB  ← won't fit on a single A100

A100-80GB Memory Budget (LLaMA-2 13B, fp16)

┌─────────────────────────────────────────────┐
│  Model Weights          ~26 GB              │
├─────────────────────────────────────────────┤
│  KV Cache               ~30–40 GB           │  ← the battleground
├─────────────────────────────────────────────┤
│  Activations / Other    ~5–10 GB            │
├─────────────────────────────────────────────┤
│  CUDA Runtime / Misc    ~2–3 GB             │
└─────────────────────────────────────────────┘

The KV cache is the only dynamic component. It grows as sequences get longer, shrinks as sequences end, and fragments if not managed carefully. Every other component is fixed at load time.

Continuous Batching and PagedAttention
The Problem with Static Batching
Traditional inference servers used static batching: wait for N requests, run until all finish. But requests finish at different times. 7 out of 8 requests finishing at step 50 while one runs to step 500 means your GPU is 87.5% idle on the long tail.
Continuous Batching

Modern engines batch at the iteration level, not the request level:

Time →  T0  T1  T2  T3  T4  T5  T6  T7
Slot 0: [A   A   A   A   ←done→  E   E   E ]
Slot 1: [B   B   ←done→  D   D   D   D     ]
Slot 2: [C   C   C   C   C   C   ←done→  F ]

As soon as request A finishes, slot 0 immediately starts request E. GPU utilization stays high.

But this creates a new problem: KV cache memory fragmentation. Different requests have different lengths. You can't pre-allocate contiguous memory blocks without wasting huge amounts.

PagedAttention (vLLM)

vLLM's PagedAttention (2023) solved this with a virtual memory analogy borrowed from OS paging.

Instead of contiguous KV blocks per request, memory is split into fixed-size pages (typically 16 tokens each). A block table maps logical pages to physical GPU memory:

Logical view (Sequence A):
  [Page 0 → Page 1 → Page 2 → Page 3]

Physical GPU memory:
  Block 7:  tokens  0–15  of Sequence A
  Block 2:  tokens 16–31  of Sequence A
  Block 15: tokens 32–47  of Sequence A
  Block 3:  tokens 48–63  of Sequence A

Block table:
  Seq A: { logical 0 → physical 7,
           logical 1 → physical 2,
           logical 2 → physical 15,
           logical 3 → physical 3 }

Benefits:

No internal fragmentation — allocate one page at a time as sequence grows
- Prefix sharing — identical system prompt KV pages shared across thousands of requests (copy-on-write)
- Swapping — pages can be evicted to CPU memory under pressure

Memory waste drops from ~30–40% (fragmentation) to under 4%.

# Simplified block table concept
class BlockTable:
    def __init__(self, block_size=16, num_blocks=1000):
        self.block_size  = block_size
        self.free_blocks = list(range(num_blocks))
        self.block_data  = {}   # physical_block_id → KV tensor
        self.tables      = {}   # seq_id → {logical → physical}

    def allocate_block(self):
        return self.free_blocks.pop()

    def get_kv(self, seq_id, logical_block_idx):
        physical = self.tables[seq_id][logical_block_idx]
        return self.block_data[physical]

How vLLM and Inference Engines Optimize KV Cache

vLLM in Practice

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,        # split across 2 GPUs
    gpu_memory_utilization=0.90,   # leave 10% headroom
    max_model_len=4096
)

outputs = llm.generate(
    ["Explain KV cache to an ML engineer"],
    SamplingParams(temperature=0.7, max_tokens=512)
)

vLLM ships: PagedAttention, continuous batching, Flash Attention v2, tensor parallelism, and prefix caching out of the box.

Flash Attention
Flash Attention avoids the expensive O(n²) memory allocation by tiling computations to fit in SRAM:

Standard attention:
  Q, K, V → HBM → compute N×N matrix → HBM → output
  Memory: O(n²)

Flash Attention:
  Q, K, V → SRAM tiles → fused kernel → output
  Never materializes full N×N matrix in HBM
  Memory: O(n)   Speed: 2–4× faster

Prefix Caching

If thousands of users share the same system prompt, their KV caches for that prefix are identical. Compute it once, share it everywhere:

Without prefix caching:
  Request 1: compute KV for [SYSTEM: 2048 tokens] + [query A]
  Request 2: compute KV for [SYSTEM: 2048 tokens] + [query B]
  Request 3: compute KV for [SYSTEM: 2048 tokens] + [query C]

With prefix caching:
  Request 1: compute + store KV for [SYSTEM]
  Requests 2, 3, ...: load cached KV, compute only [query N]

TTFT reduction: 50–90% on system-prompt-heavy workloads.

KV Cache Challenges in Long-Context Models

Modern models like GPT-4o, Claude 3.5, and Gemini 1.5 Pro support 128K–1M token contexts. This creates KV cache problems at a completely different scale.

Memory at 128K Context

# LLaMA-3 70B with GQA (8 KV heads), 128K context, batch size 1
seq_len      = 131072   # 128K tokens
num_layers   = 80
num_kv_heads = 8        # Grouped Query Attention
head_dim     = 128
dtype_bytes  = 2

kv_cache_bytes = seq_len * num_layers * num_kv_heads * head_dim * 2 * dtype_bytes
print(f"KV Cache: {kv_cache_bytes / 1e9:.2f} GB")
# KV Cache: 21.47 GB  ← for ONE sequence

4 concurrent requests = 4 × 80 GB H100s just for KV cache.

Decode Bandwidth at 128K

KV cache read per decode step:  21 GB
H100 HBM bandwidth:             3.35 TB/s
Time reading KV cache:          21 / 3350 ≈ 6.3ms per token
→ Maximum ~158 tokens/sec, memory-bandwidth limited

That ceiling gets worse with every new token added.

What Long-Context Models Use

Real-World Production Problems

OOM Crashes Under Load

Without PagedAttention, a single unexpectedly long sequence in a batch can OOM the entire batch. Modern engines handle this with eviction policies and graceful degradation — but this requires careful tuning of gpu_memory_utilization and per-request max_tokens limits.

Prefill-Decode Disaggregation
At scale, teams separate prefill and decode onto different hardware:

Prefill servers — compute-heavy, large batches, throughput- optimized
Decode servers — bandwidth-heavy, latency-sensitive, TTFT-optimized

The KV cache is computed on a prefill server then transferred to a decode server over InfiniBand or NVLink. This is the architecture behind services like Anyscale Endpoints and AWS Inferentia2 deployments.

Batching Heterogeneous Request Lengths

Batching a 100-token request with a 4096-token request pads the short one to max length — wasting KV cache memory and compute. Solutions:

Bucketing — group requests by length ± Δ
Dynamic padding — pad to nearest power of 2
Chunked prefill — break long prefills into fixed-length chunks

Cache Eviction Under Memory Pressure

When GPU memory is exhausted, KV pages must be evicted. Common policies:

LRU — evict sequences not accessed recently
Priority-based — keep high-SLA requests in GPU, swap background jobs to CPU
- Recompute on miss — evict aggressively, recompute KV from prompt if needed

Future Optimizations

KV Cache Quantization

Standard KV:    FP16 → 2 bytes/element
INT8 KV:        INT8 → 1 byte/element   (2× memory reduction)
INT4 KV:        INT4 → 0.5 bytes/element (4× memory reduction)

Research (KVQuant, KIVI) shows INT8 quantization has minimal perplexity impact. INT4 is viable for values (keys are more sensitive to quantization).

# INT8 KV cache via per-channel quantization
def quantize_kv(tensor, dtype=torch.int8):
    scale      = tensor.abs().max(dim=-1, keepdim=True).values / 127
    quantized  = (tensor / scale).to(dtype)
    return quantized, scale

def dequantize_kv(quantized, scale):
    return quantized.float() * scale

Sliding Window Attention

Standard:       token_i attends to tokens [0, i]      → O(seq_len) KV cache
Sliding window: token_i attends to tokens [i-W, i]    → O(W) KV cache

Mistral 7B uses W=4096 with a rolling buffer — KV cache memory stays constant regardless of sequence length.

Speculative Decoding

Without speculation:
  Step 1 → token 1
  Step 2 → token 2
  ...5 steps → 5 tokens

With speculation:
  Draft model generates 5 candidate tokens (1 fast pass)
  Target model verifies all 5 in 1 parallel pass
  Accept ~4–5 tokens in 2 steps instead of 5
  → 2–3× throughput improvement

Implementations: vLLM speculative decoding, Medusa, EAGLE.

Multi-head Latent Attention (MLA)
DeepSeek-V2's MLA caches a low-rank compressed representation instead of full K/V tensors, then decompresses on-the-fly during decode:

Standard KV:  Store K [d_k] + V [d_v] per token per layer
MLA:          Store c [d_c] (compressed), decompress when needed
              d_c << d_k + d_v → up to 5× smaller KV footprint

Final Summary

┌─────────────────────────────────────────────────────────────┐
│                    LLM INFERENCE PIPELINE                    │
│                                                             │
│  INPUT PROMPT                                               │
│       ↓                                                     │
│  ┌─────────────┐                                            │
│  │   PREFILL   │  All prompt tokens processed in parallel   │
│  │   PHASE     │  Compute-bound → sets your TTFT            │
│  └──────┬──────┘  Full KV cache populated                   │
│         ↓                                                   │
│  ┌─────────────┐                                            │
│  │   DECODE    │  One token per step, autoregressive        │
│  │   PHASE     │  Memory-bandwidth-bound                    │
│  │  [NEW TOK]  │  Q for new token only                      │
│  │             │  K/V appended to cache                     │
│  │             │  Attend over all cached K/V                │
│  └──────┬──────┘  → Sample next token                       │
│         └──→ Repeat until EOS or max_tokens                 │
└─────────────────────────────────────────────────────────────┘

What KV cache does:
Stores computed Key and Value tensors for all processed tokens, eliminating recomputation at each decode step. Per-step attention drops from O(n) recompute to O(1) compute + cache read.

Why it's a bottleneck:
Grows linearly with sequence length. At 128K+ contexts, consumes tens of GBs per request. Decode becomes memory-bandwidth-limited.

How modern engines solve it:

Where it's going:

LLM inference is ultimately a resource allocation problem. The GPU has a fixed memory budget, a fixed bandwidth, and a fixed compute budget. KV cache sits at the intersection of all three.

The engineers pushing the frontier on inference performance aren't just running models faster — they're rethinking what needs to be stored, what can be shared, what can be compressed, and what can be recomputed.

If you're building production LLM systems, KV cache isn't a detail. It's the design constraint everything else bends around.

Modular LLM Inference Engine from Scratch

Kotcherla Murali Krishna — Tue, 19 May 2026 17:43:54 +0000

Why vLLM, TensorRT-LLM, and llama.cpp each solve only part of the problem — and how I built inferx to fill the gap. Runs on any laptop, no GPU needed.

I spent the last few months building inferx — an open-source LLM inference optimization library that runs on any machine, including a laptop with no GPU. Along the way I learned more about how LLMs actually work at the systems level than in any course or paper I had read before.
This is that story: what the problem is, how I solved it, what the code looks like, and how you can run it in 60 seconds.

mkkotcherla / inferx

Description:Modular LLM inference library — KV cache, quantization, batching, speculative decoding

inferx ⚡

Unified LLM inference optimization library — modular, composable, production-grade.

inferx packages the hard parts of LLM serving into one clean library:

Continuous batching scheduler — iteration-level, FCFS / priority / deadline
Paged KV cache — PagedAttention with prefix caching and sliding window
Quantization — AWQ, GPTQ, INT8, FP8, per-layer mixed precision
Tensor & pipeline parallelism — multi-GPU sharding via NCCL
Speculative decoding — draft model + Medusa heads
OpenAI-compatible server — /v1/chat/completions drop-in endpoint
Prometheus + OpenTelemetry — TTFT, TBT, cost-per-request tracking
CPU-only mode — runs on any laptop, zero GPU for development

Why inferx?

Feature	vLLM	TRT-LLM	llama.cpp	inferx
Paged KV cache	✅	✅	❌	✅
Prefix caching	✅	❌	❌	✅
All quant formats	⚠️	⚠️	GGUF	✅
Speculative decoding	✅	✅	❌	✅
CPU / Metal backend	❌	❌	✅	✅
Modular API	❌	❌	❌	✅
Built-in cost tracking	❌	❌	❌	✅
Open

…

View on GitHub

The problem nobody talks about
When most people want to serve an LLM, they do something like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

for prompt in user_prompts:
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(output[0]))

This works beautifully for one user. But the moment you have ten users sending requests at the same time, it quietly falls apart:

Each request waits for the previous one to finish
GPU memory is reserved at worst-case sequence length even if the user sends 20 tokens
60–90% of VRAM sits unused at any given moment
No way to prioritize urgent requests over slow background ones

The result: a GPU costing $3/hr sitting at 15% utilization while users wait 8 seconds for a response.
This is the problem that vLLM, TensorRT-LLM, and llama.cpp exist to solve. But each solves only part of it, and none exposes a clean modular API where you can swap individual components.
That gap is where inferx comes from.

What existing tools miss:

The key gap is modularity. vLLM is a monolith — you cannot pull out just its KV cache manager and use it elsewhere. TensorRT-LLM is NVIDIA-only and closed. llama.cpp has no batching scheduler.
inferx is built so every component is independently importable:

from inferx.memory import PagedKVCacheManager
from inferx.scheduler import ContinuousBatchScheduler
from inferx.quantization import AWQQuantizer
from inferx.serving import OpenAIServer

Use only what you need. Swap any component for your own implementation.

The three ideas that make this work

Paged KV cache — treating GPU memory like an OS Every transformer layer produces Key and Value tensors for every processed token. The naive approach reserves a contiguous GPU memory block per request at maximum sequence length — wasting up to 92% of the allocation. The solution, from the PagedAttention paper (SOSP 2023), borrows from OS virtual memory paging. KV memory is divided into fixed-size 16-token blocks. A sequence's KV data lives in non-contiguous blocks tracked by a block table — exactly like an OS maps virtual addresses to physical frames. Here is how inferx implements this:

class PagedKVCacheManager:
    def allocate(self, sequence) -> int:
        """
        Allocate KV blocks for a new sequence.
        Returns number of tokens whose KV is already cached.
        """
        # Check prefix cache first
        cached_len, shared_blocks = self.prefix_cache.lookup(
            sequence.prompt_token_ids
        )
        # Share those blocks (ref-counted)
        for bid in shared_blocks:
            self.allocator.share(bid)

        # Allocate only what's needed for the uncached portion
        remaining = len(sequence.prompt_token_ids) - cached_len
        new_blocks = self.allocator.allocate_gpu(
            math.ceil(remaining / self.block_size)
        )
        sequence.block_table = shared_blocks + new_blocks
        return cached_len  # prefill can skip this many tokens

For 10 concurrent requests with 512-token max length: naive pre-allocation needs ~5 MB. PagedAttention with average 64-token generation uses 0.3 MB — 94% reduction.

Continuous batching — never let the GPU idle Traditional serving waits for a full batch to finish before starting the next. If one request generates a 500-token essay while others finished at 40 tokens, the GPU mostly idles. Continuous batching, from the Orca paper (OSDI 2022), operates at the iteration level. Every forward pass, the scheduler re-evaluates which sequences to include. New requests join mid-flight. Finished sequences immediately free their KV blocks.

class ContinuousBatchScheduler:
    def schedule(self) -> Batch:
        # Ensure decode sequences have room for 1 more token
        self._ensure_decode_space()
        # Promote preempted sequences if memory recovered
        self._restore_preempted()
        # Admit new sequences from waiting queue
        self._admit_waiting()
        # Build batch: prefill + decode sequences together
        return Batch(
            prefill_seqs=[s for s in self._running
                          if s.status == PREFILLING],
            decode_seqs=[s for s in self._running
                         if s.status == DECODING],
        )

Three strategies supported: FCFS, priority (preempts lower-priority sequences), and deadline (hard SLA targets).

Prefix caching — skip recomputing the same prompt If 1,000 users share the same system prompt, naive serving computes that KV cache 1,000 times. Prefix caching computes it once and reuses it across all matching requests. inferx uses a hash-based LRU cache:

class PrefixCache:
    def lookup(self, token_ids: List[int]):
        """Find the longest cached prefix of these token IDs."""
        num_full_blocks = len(token_ids) // self.block_size
        for n in range(num_full_blocks, 0, -1):
            prefix = token_ids[:n * self.block_size]
            h = self._hash_tokens(prefix)
            if h in self._cache:
                self._cache.move_to_end(h)  # LRU update
                return n * self.block_size, self._cache[h]
        return 0, []

On GPU with a 2,000-token shared system prompt: 3–5× speedup on TTFT for the second and subsequent requests.

Full architecture
inferx has six independently usable layers:

Try it right now — no GPU needed
inferx has a complete CPU-only mode. Run the full pipeline — scheduler, KV cache, batching, streaming, HTTP server — on any laptop.

bash git clone https://github.com/mkkotcherla/inferx.git
cd inferx
pip install -e .
python examples/quickstart.py

Output in ~3 seconds:
inferx quickstart
────────────────────────────────────────
Model: inferx-mock-cpu
KV blocks: 512

Prompt: 'The key innovation of PagedAttention is'
Output: '...'
Usage: {'prompt_tokens': 40, 'completion_tokens': 49, 'total_tokens': 89}

Done ✓

Ten concurrent requests:

bashpython examples/batch_requests.py

Requests: 10
Total tokens: 250
Wall time: 1.38s
Throughput: 180.7 tok/s
Avg latency: 138ms/request
OpenAI-compatible server:

pip install fastapi "uvicorn[standard]"
python inferx/cli.py serve --mock --port 8000

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="inferx-mock-cpu",
    messages=[{"role": "user", "content": "What is KV caching?"}],
    max_tokens=50,
)
print(response.choices[0].message.content)

Benchmark results — GPT-2 124M on CPU
Real GPT-2 124M architecture (correct OpenAI layers, random weights), CPU only, 1 thread:

The throughput numbers are similar on a 1-thread CPU — expected and honest. Scheduling overhead isn't offset by batching gains without GPU parallelism. On GPU with 50+ concurrent requests, continuous batching gives 8–12× throughput improvement because GPU tensor cores scale with batch size.
The 94% memory savings are real and hardware-independent.

The one thing I would do differently
I would have added run_concurrent() from day one. The natural instinct is asyncio.gather() — but that makes each task drive its own decode loop and they fight over scheduler state.
The correct pattern: a single shared decode loop with all requests in one queue:

async def run_concurrent(engine, prompts_and_params):
    """All requests share one decode loop — correct behavior."""
    seqs = []
    for prompt, params in prompts_and_params:
        seq = Sequence(prompt=prompt, ...)
        engine._scheduler.add_request(seq)
        seqs.append(seq)

    # One loop drives everything
    while not all(s.is_finished for s in seqs):
        await engine._step()

    return [(engine._detokenize(s.output_token_ids), usage(s))
            for s in seqs]

This is how vLLM works internally. Getting it right made throughput jump significantly.

Try it and contribute

mkkotcherla / inferx

Description:Modular LLM inference library — KV cache, quantization, batching, speculative decoding

inferx ⚡

Unified LLM inference optimization library — modular, composable, production-grade.

inferx packages the hard parts of LLM serving into one clean library:

Continuous batching scheduler — iteration-level, FCFS / priority / deadline
Paged KV cache — PagedAttention with prefix caching and sliding window
Quantization — AWQ, GPTQ, INT8, FP8, per-layer mixed precision
Tensor & pipeline parallelism — multi-GPU sharding via NCCL
Speculative decoding — draft model + Medusa heads
OpenAI-compatible server — /v1/chat/completions drop-in endpoint
Prometheus + OpenTelemetry — TTFT, TBT, cost-per-request tracking
CPU-only mode — runs on any laptop, zero GPU for development

Why inferx?

Feature	vLLM	TRT-LLM	llama.cpp	inferx
Paged KV cache	✅	✅	❌	✅
Prefix caching	✅	❌	❌	✅
All quant formats	⚠️	⚠️	GGUF	✅
Speculative decoding	✅	✅	❌	✅
CPU / Metal backend	❌	❌	✅	✅
Modular API	❌	❌	❌	✅
Built-in cost tracking	❌	❌	❌	✅
Open

…

View on GitHub

bashgit clone https://github.com/mkkotcherla/inferx.git cd inferx && pip install -e . python examples/quickstart.py # works on any laptop python examples/benchmark.py # throughput + latency numbers python inferx/cli.py serve --mock # OpenAI server, no GPU needed

11 runnable examples, 22 unit tests, Apache 2.0.
If you are working on LLM inference, building on top of this, or have questions — open an issue or Discussion. PRs are very welcome, especially for the GPU kernel layer.

Built with Python, PyTorch, and a lot of reading of the vLLM, Orca, and PagedAttention papers.

DEV Community: Kotcherla Murali Krishna

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

Table of Contents

Section 1 — What Is a KV Cache and Why Does It Exist?

What to cover:

Code snippet to include:

Key insight to land:

Section 2 — The Problem: Traditional KV Cache and Memory Fragmentation

What to cover:

Three types of fragmentation:

Diagram suggestion:

Section 3 — Inspiration from OS Virtual Memory: The vLLM Insight

What to cover:

The analogy table:

Quote worth referencing:

Section 4 — PagedAttention: How It Works

What to cover:

Section 5 — Memory Fragmentation: Before vs After

What to cover:

Fragmentation comparison diagram:

Section 6 — Throughput Gains: Numbers and Benchmarks

What to cover:

Throughput table (representative, based on vLLM paper figures):

Code snippet — measuring throughput with vLLM:

Section 7 — Trade-offs and Limitations

What to cover (balanced, Medium readers appreciate honesty):

Block size sensitivity:

Section 8 — Who Should Care About This?

Target reader callouts:

Section 9 — Key Takeaways

Memory Systems for AI Agents: The Complete Developer Guide

Introduction

Why Memory Matters

The Four Types of Agent Memory

1. Sensory / Working Memory (In-Context)

2. Episodic Memory (Conversation & Event History)

3. Semantic Memory (Knowledge & Facts)

4. Procedural Memory (Skills & Workflows)

Memory Storage Backends

Memory Architecture Patterns

Pattern 1: The Memory Stack

Pattern 2: Memory-Augmented ReAct

Pattern 3: Hierarchical Summarization

Memory Retrieval Strategies

Recency-Weighted Retrieval

Importance-Based Retention

Contextual Compression

Common Pitfalls and How to Avoid Them

Pitfall 1: Memory Hallucination Propagation

Pitfall 2: Context Flooding

Pitfall 3: No Memory Eviction Policy

Pitfall 4: Memory Without Privacy Controls

Putting It All Together: A Reference Architecture

Key Takeaways

Building Micro Agents as Production-Grade Microservices

Table of Contents

Introduction & Motivation

Why monolithic agent systems fail in production

The microservice solution

What is a Micro Agent?

Core Architecture Principles

Single Responsibility

Stateless Reasoning, Stateful Memory

Schema-First Tool Contracts

Idempotent Actions

Async by Default

Explicit Context Boundaries

Agent Service Design

Project Layout

API Contract

The AgentRunner Loop

Full Implementation

Inter-Agent Communication

Pattern Selection Matrix

gRPC Service Definition

Kafka Event Schema

Kafka Producer (in Orchestrator)

Tool Registry Service

Architecture

Tool Registration Schema