DEV Community: Dhruv Aggarwal

PagedAttention: How GPUs handle memory

Dhruv Aggarwal — Thu, 02 Jul 2026 18:08:57 +0000

Imagine you're running a high-end restaurant. When the first few guests arrive, service is lightning fast. But as the room fills up and guests start ordering complex, multi-course meals, the kitchen starts to lag.

The problem isn't that the chefs aren't skilled—it's that the counter space is cluttered. To keep track of a guest's order, the chef reserves a massive, dedicated tray for every single table, regardless of whether they ordered a 12-course tasting menu or just a glass of water. Most of those trays are empty, but they're taking up all the room in the kitchen, preventing new orders from being started.

In LLM systems, this "clutter" manifests as a spike in Time to First Token (TTFT). Now imagine it for a product with millions of users, this latency is a deal-breaker.

To understand the fix, we have to look at how LLMs actually "think" during inference.

The Two Phases of Response
When you hit enter on a long prompt, the system enters the Prefill Phase. The model processes your entire input to build a mathematical representation of your request. This is compute-bound; it's all about raw processing power.

Then comes the Decoding Phase. Because LLMs are autoregressive—generating one token at a time—the model must constantly look back at everything it has already processed to decide what comes next. This is memory-bound.

The KV Cache Problem
To avoid recalculating the entire conversation for every single new word, we use a KV Cache. It stores the "keys" and "values" of previous tokens in the GPU's VRAM.

Traditionally, the system reserves a contiguous block of VRAM based on the model's maximum context length. If the limit is 2048 tokens, it reserves space for 2048, even if you only typed "Hello."

This leads to Internal Fragmentation (wasted space inside the reserved block) and External Fragmentation (no single contiguous block large enough for a new request), severely limiting how many users a GPU can serve simultaneously.

Enter PagedAttention
Think of PagedAttention like a library archive. Instead of giving every researcher a massive, empty 10-volume set of binders and telling them to fill it linearly, the librarian gives them a "Table of Contents." As the researcher finds more information, the librarian gives them single, standardized index cards (pages) and stores them wherever there is a free slot in the cabinet. The researcher doesn't care where the cards are; they just use the Table of Contents to find them.

Why "Paged Attention" and not "Paged Allocation"?
While the mechanism is indeed about memory allocation, the reason it's called Paged Attention is that it fundamentally changes how the Attention mechanism accesses data.

In a standard Transformer, the attention mechanism expects the KV cache to be a continuous tensor (a long, unbroken strip of data). PagedAttention modifies the actual attention kernel to allow it to "hop" across non-contiguous memory blocks. It isn't just moving data around; it's changing how the model attends to its memory.

In high-scale systems, the bottleneck is rarely the algorithm itself, but how the algorithm manages the underlying hardware resources. By decoupling logical memory from physical storage, we transform a "demo" into a production-grade service.

Thinking Faster vs. Thinking Longer: Test-Time Compute

Dhruv Aggarwal — Thu, 25 Jun 2026 14:36:15 +0000

Imagine you are asked to solve a complex math problem. If you answer immediately, you’ll likely rely on intuition or a guess. But if you are given ten minutes to scratch out ideas, double-check your logic, and correct your mistakes before speaking, your accuracy improves significantly—even though your "knowledge" hasn't changed.

This is essentially the shift toward "Thinking" or Test-Time Compute in technical terms.

In traditional LLM inference, the model spends roughly the same amount of compute (tokens/FLOPs) to answer "What is 2+2?" as it does to "How do I architect a distributed database?" The "thinking" happens in a single forward pass.

Test-time compute allows the model to spend more computational resources after the prompt is received but before the final answer is delivered.

Instead of a straight line from Input to Output, the model employs strategies like:

Chain-of-Thought (CoT): Forcing the model to generate intermediate reasoning steps.
Search and Verification: Generating multiple candidate paths and using a "verifier" to pick the best one (e.g., Monte Carlo Tree Search).
Self-Correction: Iteratively refining an answer based on its own internal critique.

Why does this matter for real systems? Because we are hitting a point of diminishing returns in scaling pre-training data and model size. We can't just make the model "bigger" to make it smarter.

By shifting the compute budget to the inference phase, we can unlock higher reasoning capabilities without needing a trillion more parameters.

However, there are practical trade-offs:

Latency: This is the biggest hurdle. A model that "thinks" for 30 seconds is useless for a real-time chatbot but perfect for a coding assistant or a research agent.
Cost: More tokens generated during the "thinking" phase means higher GPU costs per request.
Verification Collapse: If the verifier is as flawed as the generator, the model may confidently double down on a wrong answer.

The lesson here is that "intelligence" in AI isn't just about the weights of the model; it's about the process the model follows to arrive at a conclusion.

Optimize for the cost of correctness. Not every query needs a deep reasoning chain; the goal is to build systems that can dynamically decide how much compute a specific problem deserves.

Probabilistic Logic, Deterministic Damage: The Risk of AI Agents

Dhruv Aggarwal — Fri, 05 Jun 2026 12:39:18 +0000

We used to treat LLM "hallucinations" as a quirky byproduct of probabilistic models—wrong answers that were mostly harmless. But as we move from chatbots to Agentic AI (LLM + Tools), the stakes have changed.

When an agent has the autonomy to execute code or call APIs, a "hallucination" is no longer just a wrong sentence; it’s a production incident. The logic is probabilistic, but the damage—burnt API budgets, corrupted databases, or unsolicited emails—is deterministic.

In my experience, building reliable agents isn't about finding a "smarter" model, but about implementing rigorous engineering discipline around the agent's autonomy.

Common failure patterns and their countermeasures:

The Infinite Loop: The agent repeats the same failed action (e.g., searching for a missing document with five slight variations of the same query), burning tokens and time.
- The Fix: Implement strict max_retries and state tracking to detect when the agent is no longer making progress.
The Imaginary API: The agent creates a plausible plan to "book a flight" despite having no access to a travel API. It simulates success because that's what the training data suggests a "helpful assistant" does.
- The Fix: Explicitly define tool constraints in the system prompt. Use a "Verifier" agent or a Human-in-the-Loop (HITL) to validate the plan before execution.
The God-Mode Tool: Giving an agent a tool with DELETE or UPDATE permissions on a production DB. One misinterpreted prompt can wipe a table.
- The Fix: Apply the Principle of Least Privilege (PoLP). Use read-only replicas for data retrieval and enforce a manual approval layer for any high-stakes write operations.

Reliability in AI doesn't come from the model's "intelligence," but from the constraints we wrap around it. By enforcing the least amount of autonomy required to complete the task, we shift the system from "unpredictable" to "managed."

Reliable systems are built on discipline, not prompts.

How are you balancing autonomy with safety? Are you relying on prompt-based constraints, or have you implemented a hard-coded verification layer?

Moving Beyond the Context Window: The Agentic Memory Architecture

Dhruv Aggarwal — Sun, 31 May 2026 12:42:10 +0000

I’ve spent a lot of time lately thinking about why some LLM agents feel "intelligent" while others just feel like chatbots with a slightly better prompt. It almost always comes down to how the system handles memory.

When we treat the context window as the only place for state, we hit a ceiling very quickly. To build an actual agent, we have to move away from "one big prompt" and toward a layered memory architecture.

Agentic Memory can be categorized in 4 layers by their function:

Working Memory: The current context window. It's our RAM—fast, essential, but wiped clean after every session.
Semantic Memory: The Vector DB or knowledge base. This is where the "world rules" and global conventions live. It’s the reference manual the agent checks to stay aligned.
Procedural Memory: The "how-to" layer. Instead of stuffing every tool description into the prompt, the agent maintains a lean index of skills and pulls in the full implementation only when a specific task triggers it. This keeps the context window clean.
Episodic Memory: This is the hardest part. It's the ability to distill a past interaction into a reusable insight. The real engineering challenge here isn't storage—it's the "forgetting" logic. Deciding what is noise and what is a core pattern is where most frameworks still struggle.

Depending on the use case, the architecture changes:

Reflex Agents: Just Working Memory.
Support Agents: Working + Procedural.
Coding Agents: The full stack.

The gap between a demo and a production-ready agent is usually the distance between simple RAG and a functioning episodic memory. The ability to compress experience into a usable state is still a significant hurdle.

Which of these layers are you currently implementing, and how are you handling the "forgetting" logic in your episodic memory?

Beyond the Demo: Operationalizing AI Agents

Dhruv Aggarwal — Sun, 24 May 2026 12:32:48 +0000

Moving an agentic system from a local demo to a production environment is where most projects fail. "Vibe-checking" outputs doesn't scale. To build a reliable system, you need a rigorous operational framework—AgentOps—to move from unpredictable behavior to deterministic reliability.

If you cannot measure the agent's decision path, you cannot debug it. If you cannot quantify the failure rate, you cannot improve it.

I break AgentOps down into three critical layers:

Observability (The "What happened?") Focus on the causal chain of decisions. Logs aren't enough; you need full traces.

End-to-End Trace Duration: Measuring the delta between user input and final output to identify latency bottlenecks.
Agent-to-Agent Handoff Latency: In multi-agent architectures, quantifying the overhead of control transfers.
Unit Cost per Request: Tracking token spend per successful task to ensure economic viability.

Evaluation (The "How well did it work?") Shifting from qualitative anecdotes to quantitative benchmarks.

Task Completion Rate (TCR): The percentage of requests that reach a successful terminal state.
Violation Rate: Frequency of guardrail breaches (e.g., executing unsafe code, leaking PII, or providing prohibited advice).
Hallucination Rate: Measuring the grounding of responses against a gold-standard dataset or retrieved context.

Optimization (The "How do we make it better?") Using data from the first two layers to refine the system.

Token Efficiency: Optimizing the prompt-to-output ratio without degrading quality.
Retrieval Precision @K: Refining the RAG pipeline to ensure the top-K retrieved documents are actually relevant.
Handoff Success Rate: Ensuring context is preserved perfectly when shifting from one specialized agent to another.

Reliability in AI agents isn't a feature; it's an infrastructure challenge.

Which of these three layers—Observability, Evaluation, or Optimization—is currently your biggest blind spot?

Architecting the Agent OS

Dhruv Aggarwal — Sat, 16 May 2026 05:53:31 +0000

Deploying autonomous agents without a management layer is a significant reliability risk. While an LLM provides the "intelligence," it lacks the operational constraints required for production. Without an orchestration layer—an "Agent OS"—you are essentially running unconstrained code with access to your critical infrastructure.

To move beyond unpredictable prototypes, we need to treat Agent orchestration as a systems design problem. A robust Agent OS must implement these six primitives:

Scheduler & Orchestrator: Manages task prioritization and resource allocation to prevent race conditions and ensure high-priority tasks aren't pre-empted by recursive loops.
Memory Manager: Solves the context window limitation by bridging Short-Term Memory (current session state) with Long-Term Memory (vector databases/RAG) to prevent repetitive loops and state loss.
Tool Manager: Implements a secure execution layer. Instead of granting direct API access, it provides a sandboxed environment (e.g., isolated containers) to prevent catastrophic failures like accidental database drops.
Identity Manager: Enforces the Principle of Least Privilege (PoLP) using ephemeral tokens and certificates. This ensures that an agent's identity is scoped to a specific task and expires immediately after execution.
Observability: Provides deterministic tracing for non-deterministic outputs. Every decision, tool call, and state change must be logged to allow for post-mortem debugging and auditing.
Guardrails & Governance: A dual-layer defense. Technical guardrails filter malicious injections and profane outputs, while governance frameworks enforce "Human-in-the-Loop" (HITL) triggers for high-stakes mutations.

The goal is to shift the paradigm from "hope it works" to a system defined by predictability, security, and trust.

For those of you moving agents into production: Which of these layers is currently your biggest point of failure—memory persistence or secure tool execution?

Why your infra is the silent bottleneck in your AI systems?

Dhruv Aggarwal — Fri, 08 May 2026 11:00:40 +0000

Getting high-quality responses from an LLM is rarely a model problem; it is almost always an infrastructure problem.

Frontier models have the reasoning capabilities, but they are limited by the quality and accessibility of the context they are given. This is where Context Engineering—the intersection of RAG and Prompt Engineering—becomes the critical path.

The challenge is that enterprise context is fragmented. It's spread across DBs, SaaS platforms, and on-prem systems, varying between structured and unstructured, and heavily guarded by RBAC.

To solve the context bottleneck, I view the architecture through four pillars:

Connected Access: Use zero-copy federation. Access data where it lives rather than creating unfederated copies. This provides the LLM with immediate visibility.
Knowledge Layer: Implement entity resolution and institutional knowledge mapping on top of raw data to provide actual meaning.
Precision Retrieval: Prioritize data by intent, role, and policy. More context does not equal more knowledge; precision ensures relevancy.
Runtime Governance: Apply dynamic checks to determine if a specific data source should be queried based on the user's permissions. This makes the system defensible.

Ultimately, an AI system is only as effective as the context it can retrieve.

How are you handling context retrieval and RBAC in your current AI pipelines?