DEV Community: Siddhant Khare

AI Agent stack you need Context, Auth, and Cognitive Debt

Siddhant Khare — Sat, 28 Mar 2026 05:27:53 +0000

Most AI content teaches you how to write prompts.

This is not that.

I've spent three years at Ona building platform infrastructure for 1.7 million developers. I'm the first independent maintainer of OpenFGA, the CNCF authorization system based on Google's Zanzibar paper. I built Distill, a context deduplication library that cuts token usage by 30-40% in 12ms. I wrote an essay on AI fatigue that hit #1 on Hacker News, got covered by Business Insider, Futurism, and The New York Times, and was cited by the Hard Fork podcast.

I wrote down everything I learned from that work. The result is the Agentic Engineering Guide: 216 pages, 33 chapters, covering the full stack from context engineering to agent governance.

But before you decide whether to read it, let me give you the most useful parts for free.

The thing that breaks first

When teams move from Level 2 (chat agents) to Level 3 (agents that actually execute code, call APIs, write files), the first thing that breaks is not the model. It's authorization.

Your agent has access to your database. Your secrets. Your production environment. What permission model are you using?

Most teams answer: "the same one as the developer who set it up."

That's the wrong answer. A developer has permissions scoped to their identity and their judgment. An agent has permissions scoped to... whatever you gave it, running autonomously, at 2am, without anyone watching.

The guide covers Zanzibar-based authorization for agents, the Rule of Two (no agent action should be irreversible without a second check), and why most MCP deployments have a security gap that most teams don't discover until something goes wrong.

The 30-40% problem

Here's a number that should concern you: 30-40% of the context you send to your LLM is redundant.

Your documentation says the same thing as your code comments. Your FAQ overlaps with your support tickets. Your API docs repeat what's in your tutorials. The LLM sees the same fact five different ways and gets confused. Same input, different output. Every time.

The instinct is to fix the prompt. It doesn't work. You cannot prompt your way out of garbage context.

The fix is upstream. Context engineering is the discipline of cleaning, deduplicating, compressing, and structuring the information before it reaches the model. The guide covers the 4-layer context stack, the meta-MCP pattern that cuts token usage by 88%, and why deterministic preprocessing beats LLM-based compression every time.

What 300 engineers told me about AI fatigue

In late 2025, I published a post about AI fatigue in engineering teams. It hit #1 on Hacker News. The comments were more useful than the post.

The pattern that emerged: teams that adopted AI tools without changing their workflows burned out faster than teams that didn't adopt AI at all. The tools added cognitive load without removing it. Engineers were reviewing AI output on top of writing their own code, not instead of it.

The teams that succeeded did something different. They treated agent adoption as an organizational change problem, not a technology problem. They changed review processes, changed how they measured productivity, changed what they expected from junior engineers. The technology was the easy part.

Chapter 20 of the guide covers the AI fatigue patterns in detail. Chapter 21 covers the Conductor Model: the workflow that lets engineers direct agents without becoming agents themselves.

The maturity model

Where does your team fall?

Level 1: Experimental. Individual developers using Copilot or Claude. No team policies. No shared context. No measurement.

Level 2: Structured. Team has agreed on which tools to use and when. Basic review policies. Some measurement of output quality.

Level 3: Integrated. Agents in the CI/CD pipeline. Automated quality gates. Cost tracking. Incident response procedures for when agents break things.

Level 4: Orchestrated. Agents run autonomously on task queues. Multi-agent systems with defined handoffs. Human oversight at the decision level, not the execution level.

Level 5: Autonomous. Agents operate 24/7. Background agents monitor repositories, fix issues, generate tests, update documentation. Humans set goals and review outcomes.

Most teams in early 2026 are at Level 2. The transition to Level 3 is where the engineering discipline becomes essential. The transition to Level 4 is where it becomes critical.

The guide has a full maturity assessment with specific practices for each level and a roadmap for moving between them.

The cognitive debt problem

Technical debt is code that works but is hard to maintain.

Cognitive debt is code that works but nobody understands.

At Ona, 88.5% of merged PRs are agent-authored. That's not a boast. It's a warning. When AI writes most of your code, the team's mental model of the codebase degrades. Engineers can review individual PRs without understanding the system those PRs are building. The code is correct. The understanding is gone.

This is more dangerous than technical debt. You can pay down technical debt by refactoring. You pay down cognitive debt by reading code you didn't write, understanding systems you didn't design, and rebuilding mental models that were never formed in the first place.

The guide covers three practices for managing cognitive debt: mandatory architecture reviews before agent-authored features ship, "explain this to me" sessions where engineers walk through agent-authored code without looking at the diff, and rotation policies that ensure every engineer touches every part of the codebase.

What's in the guide

33 chapters across 10 parts:

Foundations: What agents are, what they can do, the capability spectrum from Level 1 to Level 4
Context Engineering: The 4-layer stack, RAG vs. agentic search, token economics
Security & Authorization: The agent threat model, Zanzibar for agents, prompt injection, sandboxing
Protocols & Standards: MCP in production, A2A communication, AGENTS.md
Observability: OpenTelemetry for agents, cost tracking, incident response
Orchestration: The agent loop, multi-agent systems, memory and checkpoints
Team Practices: AI fatigue, the Conductor Model, the maturity model
Production Workflows: Your first agent in production, security checklists, measuring impact
Production Engineering: Evaluation, enterprise adoption, FinOps, governance, model routing
The Adoption Playbook: A step-by-step guide for taking a team from Level 1 to Level 3

Plus four appendices: tool directory, glossary, further reading, and templates.

Who it's for

Engineering leaders, senior engineers, and platform architects who are adopting AI agents or deciding whether to.

You should be comfortable with software engineering concepts (distributed systems, API design, CI/CD, observability). You don't need prior experience with AI or machine learning.

This is not a coding tutorial. Not a vendor comparison. Not a prompt engineering guide. It's a book about engineering judgment in the age of AI agents.

Get it

The full guide is free to read at agents.siddhantkhare.com and open source on GitHub.

If you want the PDF or EPUB to read offline, it's on Gumroad at pay-what-you-want (minimum $11). All future updates included.

Read free online →

Get the PDF / EPUB →

Questions? I'm @Siddhant_K_code on X or Siddhant Khare on LinkedIn. Drop a comment below if you want me to go deeper on any of these topics.

Ona (formerly Gitpod) is re-launching its Open Source program

Siddhant Khare — Tue, 03 Feb 2026 14:36:58 +0000

Gitpod started as an open-source project. Long before “AI productivity” became a thing, the core problem we were trying to solve was simple:

help contributors get productive without wasting maintainer time.

Over the years, working closely with open-source maintainers using Gitpod’s Open Source plan, the same issues kept coming up:

PR backlogs grow faster than they can be reviewed
Maintainers spend large amounts of time onboarding contributors and answering setup questions
Reviewing changes often means reconstructing context instead of focusing on intent and correctness
And now, with AI tools everywhere, maintainers also have to sift through a growing volume of low-signal or poorly contextualized PRs

Recently, Gitpod evolved into Ona. The product has grown, but the maintainer problems haven’t gone away.

That’s why we’ve brought back the Open Source plan, now as the Ona for Open Source program.

What’s the focus?

This isn’t about adding more tools. It’s about reducing friction where it hurts most.

Ona for Open Source is designed to help:

Maintainers review PRs faster by spending less time reconstructing context and unblocking contributors
Projects keep backlogs manageable as contribution volume increases
Contributors start working with clearer expectations and fewer setup-related questions
Teams keep signal high even as AI-assisted contributions become more common

If you’re curious about the transition from Gitpod to Ona, here’s more context: https://ona.com/stories/gitpod-is-now-ona

And if you maintain (or contribute to) an open-source project and want to check out the program: https://ona.com/open-source. You can get up to $200/month in free credits.

Open source survives because maintainers keep showing up. If we can reduce even a small part of that load, especially in an AI-heavy world, it’s worth doing.

Happy to hear feedback, particularly from maintainers on what still feels broken.

Containers aren’t a sandbox for AI agents

Siddhant Khare — Sat, 10 Jan 2026 18:04:07 +0000

Where containers stop being simple

Containers are sold as a solved abstraction. You package a filesystem, declare a process, and the world becomes reproducible. That story is mostly true - until the moment you ask the container to do something that leaks across the kernel boundary.

That moment is usually accidental.

You start by “just adding a dependency.” Maybe a browser for tooling. Maybe an emulator. Maybe a sandbox that needs stronger isolation. The Dockerfile grows a few lines. Everything still builds. Tests still pass. And then, quietly, you hit the edge of what containers can actually promise.

I hit that edge while working on a containerized IDE environment - one that wasn’t just compiling code, but running a full graphical toolchain and emulators inside a browser-accessible container. On paper, it was still “just Docker.” In practice, it forced a confrontation with an uncomfortable truth:

containers don’t virtualize the kernel; they borrow it.

Once you internalize that, a lot of container folklore collapses.

Userland is easy. The kernel is not

The first category of problems is deceptively straightforward. You want to harden behavior inside the environment - prevent certain protocol handlers, restrict what happens when a user clicks a link, reduce accidental escape hatches. That lives squarely in userland.

You install packages.
You write config files.
You control defaults.

This feels like progress because it is progress. It’s policy expressed as files, and containers are excellent at that.

Then comes the second category of problems, which look similar but are fundamentally different.

You want acceleration.
You want virtualization.
You want isolation stronger than namespaces.

So you install QEMU. You add configuration files that reference KVM. You write the incantations that every blog post seems to recommend. The image builds fine.

And nothing actually changes.

Because at this point, you are no longer configuring the container. You are attempting to configure the host kernel from inside a process that does not own it.

No amount of Dockerfile cleverness can cross that boundary.

Nested virtualization, device access, hardware acceleration - these are not properties of images. They are properties of the execution environment. They depend on CPU flags, kernel modules, hypervisor configuration, and runtime privileges. A container can only benefit from them if the host explicitly allows it to.

This is the moment many container designs quietly break. Not because the idea was wrong, but because the abstraction was overextended.

The same boundary shows up in agent systems

This matters far beyond IDEs or emulators.

Modern AI systems increasingly rely on agents - processes that don’t just think, but act. They run tools. They clone repositories. They install dependencies. They execute arbitrary code. Often concurrently. Often on behalf of users.

At first glance, containers seem perfect for this:

One container per agent.
Clean filesystem.
Resource limits via cgroups.
Tear down when done.

This works - until you care about any of the following:

Running untrusted code.
Preventing lateral movement.
Controlling outbound network behavior.
Enforcing strict filesystem policies.
Supporting Docker-in-Docker–like workflows.
Providing hardware acceleration safely.

At that point, you rediscover the same boundary: containers are not security sandboxes; they are process isolation with a shared kernel.

If your agent needs to cross into host-level capabilities - starting sibling containers, accessing /dev/kvm, mounting filesystems, manipulating network namespaces - you are back in the world of privileges, devices, and kernel trust.

The IDE problem and the agent problem are the same problem wearing different clothes.

Strong isolation is not a container problem

There is a recurring mistake in infrastructure design: trying to solve policy problems with packaging tools.

Containers are packaging plus lightweight isolation. They are fantastic for reproducibility and deployment. They are not a complete security boundary.

Once you accept that, architecture decisions become clearer.

If your agents run trusted code, containers may be enough.

If your agents run untrusted code, containers are probably insufficient.

That’s when other tools appear:

MicroVMs (Firecracker, Kata).
Sandboxed runtimes (gVisor).
Ephemeral execution environments.
Strict syscall filters and egress policies.

These systems are slower to spin up and harder to operate, but they draw the boundary in the right place: at the kernel interface, not inside it.

What looks like extra complexity is often just honesty about where isolation actually comes from.

The real product is policy

The most important lesson from all of this is subtle:
the hard part is not running code - it’s deciding what that code is allowed to do.

Opening links.
Accessing the network.
Reading from disk.
Writing artifacts.
Using hardware acceleration.

Every meaningful system ends up encoding policy, whether explicitly or by accident. Containers make it easy to ship policy as configuration, but they don’t remove the need to reason about it.

Agent orchestration systems that scale will not be defined by clever prompts or clever scheduling. They will be defined by:

Clear trust boundaries.
Explicit execution contracts.
Reproducible but constrained environments.
Observability that maps actions back to intent.

That’s not an AI problem. That’s an infrastructure problem we’ve been solving for decades - just under different names.

Containers are still the right starting point

None of this is an argument against containers.

Containers are still the best default abstraction we have. They let us experiment cheaply, reason locally, and iterate fast. They are the right place to start.

But they are not the place to stop.

Every serious system eventually reaches the point where “just put it in Docker” stops being an answer and starts being a question. When that happens, the mistake is not hitting the limit - it’s pretending the limit isn’t there.

The moment you need kernel features, hardware guarantees, or hostile-code isolation, the architecture must change.

The good news is that this boundary is predictable. You can see it coming if you know what to look for.

The bad news is that you can’t paper over it with a Dockerfile.

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

*Built by Siddhant Khare

The Engineering guide to Context window efficiency

Siddhant Khare — Tue, 23 Dec 2025 08:28:57 +0000

A deep dive into semantic deduplication for LLM context windows

If you're building with RAG (Retrieval-Augmented Generation), you've probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory system—just worded slightly differently.

This isn't a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That's wasted tokens, higher API costs, and confused model outputs.

I built GoVectorSync to fix this. Here's the technical deep-dive on the problem and solution.

The Problem: Semantic Redundancy in Multi-Source RAG

Modern AI agents pull context from multiple sources:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                           │
│                "How do I reset my password?"                │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │           Context Sources               │
        ├─────────────────────────────────────────┤
        │  📄 RAG (Documentation)                 │
        │  🔧 MCP Tools (API responses)           │
        │  🧠 Memory (Past conversations)         │
        │  ⚡ Skills (Procedural knowledge)       │
        └─────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Retrieved Chunks                          │
├─────────────────────────────────────────────────────────────┤
│ [RAG]    "To reset your password, click 'Forgot Password'  │
│           on the login page..."                             │
│ [RAG]    "Password reset: Navigate to login, select        │
│           'Forgot Password'..."                             │
│ [MCP]    "User password can be reset via the forgot        │
│           password flow in the auth system"                 │
│ [Memory] "Last time you asked, I explained the password    │
│           reset uses the forgot password link..."           │
│ [RAG]    "Account deletion is available in Settings..."    │
│ [MCP]    "Delete account: Settings > Account > Delete"     │
│ [RAG]    "Set up 2FA for extra security..."                │
│ [Skills] "Contact support at support@example.com"          │
└─────────────────────────────────────────────────────────────┘

Look at those first four results. They're all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.

The Math of Waste

If you retrieve 8 chunks and 5 are duplicates:

62% of your context window is wasted
You're paying for tokens that add no information
The LLM sees repetition, which can bias its response
You're missing diverse information that could help

Why Cosine Similarity Isn't Enough

You might think: "Just dedupe by cosine similarity threshold."

The problem is choosing that threshold:

Similarity between chunks about password reset:

"To reset your password, click 'Forgot Password'..."
    vs
"Password reset: Navigate to login, select 'Forgot'..."

Cosine Similarity: 0.82

Is 0.82 a duplicate? What about 0.75? 0.68?

A fixed threshold fails because:

Domain variance: Technical docs cluster tighter than conversational text
Length effects: Longer chunks have different similarity distributions
Embedding model quirks: Different models have different similarity ranges

You need something smarter.

The Solution: Cluster → Select → Diversify

GoVectorSync uses a three-stage pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    GoVectorSync Pipeline                    │
└─────────────────────────────────────────────────────────────┘

     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │   STAGE 1    │     │   STAGE 2    │     │   STAGE 3    │
     │  Over-fetch  │────▶│   Cluster    │────▶│   Select +   │
     │   (3-5x K)   │     │ (Semantic)   │     │     MMR      │
     └──────────────┘     └──────────────┘     └──────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │  50 chunks   │     │  12 clusters │     │   8 chunks   │
     │  from VectorDB│     │  by meaning  │     │   diverse    │
     └──────────────┘     └──────────────┘     └──────────────┘

Stage 1: Over-fetch

Instead of retrieving exactly K chunks, retrieve 3-5x more:

// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK  // 50

This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost you'll save.

Stage 2: Agglomerative Clustering

We group semantically similar chunks using hierarchical clustering:

┌─────────────────────────────────────────────────────────────┐
│              Agglomerative Clustering                       │
└─────────────────────────────────────────────────────────────┘

Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]

Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]

Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]

Stop when distance > threshold (0.15)

Final clusters:
┌─────────────────┐ ┌─────────────────┐ ┌────────┐ ┌────────┐
│ Password Reset  │ │ Account Delete  │ │  2FA   │ │Support │
│ C1, C2, C3, C4  │ │    C5, C6       │ │  C7    │ │  C8    │
└─────────────────┘ └─────────────────┘ └────────┘ └────────┘

The algorithm:

func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
    // Initialize: each chunk is its own cluster
    nodes := make([]*clusterNode, n)
    for i := range chunks {
        nodes[i] = &clusterNode{
            members:  []int{i},
            centroid: chunks[i].Embedding,
            active:   true,
        }
    }

    // Compute pairwise distance matrix
    distMatrix := c.computeDistanceMatrix(chunks)

    // Agglomerative merging
    for activeCount > 1 {
        // Find closest pair
        minDist, minI, minJ := findClosestPair(nodes, distMatrix)

        // Stop if distance exceeds threshold
        if minDist > c.cfg.Threshold {
            break
        }

        // Merge clusters
        mergeClusters(nodes[minI], nodes[minJ], chunks)
        nodes[minJ].active = false
    }

    return buildResult(nodes, chunks)
}

Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).

func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
    switch c.cfg.Linkage {
    case "single":
        // Min distance - can cause chaining
        return minPairwiseDistance(a, b, distMatrix)
    case "complete":
        // Max distance - very conservative
        return maxPairwiseDistance(a, b, distMatrix)
    case "average":
        // Mean distance - balanced
        return avgPairwiseDistance(a, b, distMatrix)
    }
}

Stage 3: Representative Selection

From each cluster, we pick one representative. Multiple strategies:

┌─────────────────────────────────────────────────────────────┐
│                Selection Strategies                         │
└─────────────────────────────────────────────────────────────┘

Strategy: "score" (default)
─────────────────────────────
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking

    Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
    Selected: C1 (score 0.92)


Strategy: "centroid"
─────────────────────────────
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk

    Cluster centroid: [0.12, -0.34, 0.56, ...]

    C1 distance to centroid: 0.08
    C2 distance to centroid: 0.12  
    C3 distance to centroid: 0.05  ← Selected
    C4 distance to centroid: 0.15


Strategy: "hybrid"
─────────────────────────────
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality

    hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity

func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
    switch s.cfg.Strategy {
    case SelectByScore:
        return s.selectByScore(cluster)      // Highest retrieval score
    case SelectByCentroid:
        return s.selectByCentroid(cluster)   // Closest to centroid
    case SelectByHybrid:
        return s.selectByHybrid(cluster)     // Weighted combination
    }
}

Stage 4: MMR Re-ranking (Optional)

After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:

┌─────────────────────────────────────────────────────────────┐
│           Maximal Marginal Relevance (MMR)                  │
└─────────────────────────────────────────────────────────────┘

Formula:
    MMR(chunk) = λ × relevance(chunk) 
               - (1-λ) × max_similarity(chunk, already_selected)

Where:
    λ = 0.5 (balanced)
    λ = 1.0 (pure relevance, no diversity)
    λ = 0.0 (pure diversity, ignore relevance)


Example with λ = 0.5:
─────────────────────────────

Candidates: [A, B, C, D, E]  (already selected: none)

Round 1:
    MMR(A) = 0.5 × 0.95 - 0 = 0.475  ← Selected (highest relevance)

Round 2: (A already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × sim(B,A)
           = 0.45 - 0.5 × 0.85 = 0.025
    MMR(C) = 0.5 × 0.85 - 0.5 × sim(C,A)
           = 0.425 - 0.5 × 0.20 = 0.325  ← Selected (diverse!)

Round 3: (A, C already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × max(sim(B,A), sim(B,C))
    ...

The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.

func (m *MMR) computeMMRScore(candidateIdx int, selected []int, 
                               scores []float64, simMatrix [][]float64) float64 {
    relevance := scores[candidateIdx]

    if len(selected) == 0 {
        return m.cfg.Lambda * relevance
    }

    // Find max similarity to any selected chunk
    maxSim := 0.0
    for _, selIdx := range selected {
        sim := simMatrix[candidateIdx][selIdx]
        if sim > maxSim {
            maxSim = sim
        }
    }

    // MMR formula
    return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}

The Full Pipeline

Here's how it all fits together:

┌─────────────────────────────────────────────────────────────┐
│                  GoVectorSync Broker                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBED QUERY                                              │
│    "How do I reset my password?" → [0.12, -0.34, ...]      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. OVER-FETCH FROM VECTOR DB                                │
│    Query Pinecone/Qdrant for top 50 chunks                  │
│    Include embeddings for clustering                        │
│                                                             │
│    Latency: ~15ms                                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. AGGLOMERATIVE CLUSTERING                                 │
│    50 chunks → 15 clusters                                  │
│    Threshold: 0.15 cosine distance                          │
│                                                             │
│    Latency: ~8ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REPRESENTATIVE SELECTION                                 │
│    15 clusters → 15 representatives                         │
│    Strategy: highest score per cluster                      │
│                                                             │
│    Latency: ~1ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MMR RE-RANKING                                           │
│    15 representatives → 8 diverse chunks                    │
│    Lambda: 0.5 (balanced relevance/diversity)               │
│                                                             │
│    Latency: ~3ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ RESULT                                                      │
│ 8 chunks, each covering a distinct topic:                   │
│ • Password reset                                            │
│ • Account deletion                                          │
│ • 2FA setup                                                 │
│ • Support contact                                           │
│ • Billing                                                   │
│ • API keys                                                  │
│ • Data export                                               │
│ • Team management                                           │
│                                                             │
│ Total added latency: ~12ms                                  │
└─────────────────────────────────────────────────────────────┘

Implementation Details

Distance Matrix Computation

For N chunks, we compute an N×N distance matrix. This is O(N²) but N is small (50-100 chunks):

func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
    n := len(chunks)
    matrix := make([][]float64, n)

    for i := 0; i < n; i++ {
        matrix[i] = make([]float64, n)
        for j := i + 1; j < n; j++ {
            dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
            matrix[i][j] = dist
            matrix[j][i] = dist  // Symmetric
        }
    }
    return matrix
}

Cosine Distance

func CosineDistance(a, b []float32) float64 {
    var dot, normA, normB float64
    for i := range a {
        dot += float64(a[i] * b[i])
        normA += float64(a[i] * a[i])
        normB += float64(b[i] * b[i])
    }

    if normA == 0 || normB == 0 {
        return 1.0  // Max distance
    }

    similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
    return 1.0 - similarity  // Convert to distance
}

Centroid Update on Merge

When merging clusters, we recompute the centroid as the mean of all member embeddings:

func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
    a.members = append(a.members, b.members...)

    // Recompute centroid
    dim := len(chunks[0].Embedding)
    newCentroid := make([]float32, dim)

    for _, idx := range a.members {
        for d := 0; d < dim; d++ {
            newCentroid[d] += chunks[idx].Embedding[d]
        }
    }

    invN := float32(1.0 / float64(len(a.members)))
    for d := 0; d < dim; d++ {
        newCentroid[d] *= invN
    }

    a.centroid = newCentroid
}

Configuration Tuning

Cluster Threshold

The threshold controls how aggressively we merge:

Threshold	Effect	Use Case
0.10	Conservative, more clusters	High-precision domains
0.15	Balanced (default)	General purpose
0.20	Aggressive, fewer clusters	Noisy/redundant data
0.25+	Very aggressive	Heavy deduplication

MMR Lambda

Lambda	Effect	Use Case
0.3	Diversity-focused	Exploratory queries
0.5	Balanced (default)	General purpose
0.7	Relevance-focused	Precise queries
1.0	Pure relevance	No diversity needed

Over-fetch Ratio

Ratio	Chunks Retrieved	Trade-off
2x	16 for K=8	Minimal overhead, less dedup potential
3x	24 for K=8	Good balance
5x	40 for K=8	Maximum dedup, higher retrieval cost

Performance

Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):

Stage	Latency
Distance matrix	2ms
Clustering	6ms
Selection	<1ms
MMR	3ms
Total overhead	~12ms

This is negligible compared to:

Vector DB query: 15-50ms
LLM inference: 500-2000ms

But the savings are significant:

35% fewer tokens per query
2x more diverse context
Better LLM answers (less confusion from repetition)

Usage

As a Proxy

GoVectorSync sits between your app and vector DB:

┌─────────┐     ┌──────────────┐     ┌────────────┐
│  Your   │────▶│ GoVectorSync │────▶│  Pinecone  │
│   App   │◀────│    Proxy     │◀────│   Qdrant   │
└─────────┘     └──────────────┘     └────────────┘

Direct Integration

Request for access for Go SDK & demo

What's Next

I'm building GoVectorSync as an open-source tool for the RAG community. Current focus:

More vector DB integrations (Weaviate, Milvus, Chroma)
Ingestion-time dedup (deduplicate before storing)
Adaptive thresholds (learn optimal settings per namespace)
Streaming support (deduplicate as chunks arrive)

Try it

Join the waitlist for early access:

https://distill.siddhantkhare.com/

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

*Built by Siddhant Khare

Beyond finding: Remediating CVE-2025-55182 across hundreds of repositories with Ona Automations

Siddhant Khare — Tue, 09 Dec 2025 08:15:19 +0000

Finding vulnerable code is only half the battle. When a critical CVE drops, engineering teams face a familiar nightmare: discovering affected repositories, coordinating fixes across teams, and ensuring nothing slips through the cracks. What if you could fix them all, automatically?

The CVE Remediation problem at scale

When CVE-2025-55182, a critical CVSS 10.0 vulnerability in React Server Components, was disclosed on November 29th, 2025, organizations scrambled to assess their exposure. The vulnerability affects any application using React Server Components with packages like react-server-dom-webpack, react-server-dom-parcel, or react-server-dom-turbopack in versions 19.0 through 19.2.0.

Code search tools help you find affected repositories. But what happens next?

For most teams, the answer involves:

Creating tickets for each affected repository
Coordinating across multiple teams and timezones
Manually applying the same fix hundreds of times
Hoping no repositories get missed
Spending days or weeks on what should be hours of work

Ona Automations changes this equation entirely.

From discovery to remediation in minutes

Ona Automations are end-to-end workflows that execute changes across your entire codebase, in parallel. Instead of finding vulnerabilities and then spending weeks coordinating fixes, you can discover, remediate, test, and create pull requests across hundreds of repositories simultaneously.

Here's how it works for CVE-2025-55182:

Step 1: Create the Automation

Navigate to Automations in Ona and click New Automation. Give it a name like "CVE-2025-55182 Remediation" and select a service account to run it, this ensures all commits and PRs are clearly attributed to automation rather than individual engineers.

Step 2: Define your target scope

Use GitHub repository search to target all potentially affected repositories: org:your-org package.json react-server-dom

Or target specific projects that you know use React Server Components:

Select Projects as your target type
Choose your frontend applications, Next.js services, or any projects using RSC

Step 3: Configure the Remediation Steps

Ona Automations support three step types: prompts (natural language instructions for Ona Agent), shell scripts (deterministic commands), and pull request steps (automated PR creation).

For CVE-2025-55182, a multi-step workflow might look like:

Step 1 - Prompt: Analyze and upgrade dependencies

Analyze this repository for vulnerable React Server Components packages 
(react-server-dom-webpack, react-server-dom-parcel, react-server-dom-turbopack) 
in versions 19.0, 19.1.0, 19.1.1, or 19.2.0. 

If found, upgrade to the latest patched version. Also check for and upgrade 
any dependent frameworks:
- Next.js to 15.5.7+ or 16.0.7+
- React Router if using RSC features

Update package.json and run the package manager to update lock files.

Step 2 - Shell Script: verify the fix

npm install && npm test

Step 3 - Pull Request: submit for review

Title: [Security] Remediate CVE-2025-55182 - React Server Components RCE
Description: Automated security update to patch critical RCE vulnerability 
in React Server Components. See: https://nvd.nist.gov/vuln/detail/CVE-2025-55182

Step 4: Set Guardrails

Before running at scale, configure guardrails to control execution:

Max concurrent executions: Start with 10 to validate the automation works correctly
Max total executions: Set to match your repository count (e.g., 100 for initial rollout)

For critical vulnerabilities, you might scale up to 50 concurrent executions across 500+ repositories after initial validation.

Step 5: Execute and Monitor

Click Run. Ona spins up isolated environments for each repository, running your automation steps in parallel. The Action Run Details page shows real-time progress:

Running: Currently executing
Pending: Queued and waiting
Completed: Successfully finished
Failed: Encountered errors (click to see logs)

Each action maintains full conversation logs showing exactly what Ona Agent did, what commands ran, and any errors encountered.

Why this matters:

A customer recently shared their experience with CVE remediation using Ona Automations:

"90–95% of work is done by Ona Automations. We just have to do the final push commands."

The math speaks for itself:

Approach	100 Repositories	500 Repositories
Manual remediation	2-3 weeks	6-8 weeks
Ona Automations	2-3 hours	4-6 hours

That's not just time saved, it's reduced vulnerability exposure time. Every hour a CVE remains unpatched is an hour of risk.

Scheduled Scanning: prevention over reaction

Beyond one-time remediation, Ona Automations support time-based triggers for ongoing security hygiene:

Schedule: Weekly, Monday at 2:00 AM
Target: All repositories
Steps:
  1. Scan for known CVEs in dependencies
  2. Upgrade vulnerable packages
  3. Run tests to verify compatibility
  4. Create PRs for any changes

This transforms CVE response from reactive firefighting to proactive maintenance. Your repositories stay patched automatically, with pull requests ready for review each Monday morning.

Security built in

Ona Automations include enterprise-grade guardrails:

Environment isolation: Each automation runs in a dedicated, isolated environment
Command deny lists: Prevent execution of dangerous commands like sudo or rm -rf /
Audit trails: Complete logging of every command, file modification, and PR creation
Service account separation: Clear distinction between automation activity and human work
Concurrency limits: Prevent runaway executions and control resource usage

Getting started

Ready to transform how you handle CVE remediation?

Create a service account in Settings → Members → Service Accounts
Configure Git authentication with appropriate repository access
Create your first automation targeting a small set of repositories
Validate the results by reviewing generated PRs
Scale up by increasing guardrail limits

For CVE-2025-55182 specifically, start by targeting repositories matching package.json react-server-dom in your organization. Run on 5-10 repositories first to validate the automation behaves correctly, then scale to your full repository base.

Beyond CVEs: What else can you automate?

The same patterns that work for CVE remediation apply to:

Dependency updates: Weekly automated upgrades with compatibility testing
Code migrations: API changes, framework upgrades, or deprecation handling
Documentation updates: Keep READMEs, Backstage YAML, and API docs current
Compliance enforcement: License checks, security policy updates, configuration standardization
Pull request reviews: Automated security analysis on every code change using PR triggers

*Ona Automations is available for Enterprise customers. *Request a demo to see how automations can transform your organization's approach to large-scale code changes.

Have questions about setting up CVE remediation automations? Reach out to your account manager or explore our Automations documentation.

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

Securing Agentic AI: authorization patterns for autonomous systems

Siddhant Khare — Sat, 29 Nov 2025 11:28:52 +0000

Why traditional access control fails for AI agents, and how relationship-based authorization provides a path forward.

The first time I watched an AI agent autonomously chain together twelve API calls to complete a task, I felt two things simultaneously: excitement at the capability, and dread at the security implications.

The agent had been asked to "prepare the weekly team update." It read from Linear, queried our metrics dashboard, pulled context from Slack, drafted content, and posted to Notion. Twelve tools. Thirty seconds. Zero authorization checks beyond the initial OAuth tokens we'd configured at setup.

Every one of those tokens had broad permissions. The agent could have read any Linear ticket, any Slack channel, modified any Notion page. We'd built an incredibly capable system with the security model of a shared password on a sticky note.

This is the state of authorization in agentic AI today. And it's a problem we need to solve before these systems become critical infrastructure.

The Authorization model mismatch

Traditional authorization was designed for a straightforward interaction pattern:

Human → Action → Resource

A user clicks "delete," the system checks if they have the delete permission on that resource, and the action proceeds or fails. The human is present, the action is discrete, and the permission check is synchronous.

Agentic AI breaks every assumption in this model:

Human → Agent → [Plan] → Tool₁ → Tool₂ → ... → Toolₙ → Resources

The human initiates a goal, not an action. The agent autonomously decides which actions to take, in what sequence, on which resources. The human may not even know what tools will be invoked until after execution completes.

This creates three fundamental problems that traditional Role-Based Access Control (RBAC) cannot address.

Problem 1: Delegation without boundaries

When you grant an agent access to your email, what are you actually authorizing?

In most current implementations, you're handing over an OAuth token with scopes like gmail.readonly or gmail.compose. The agent now has the same access you do—to all your emails, regardless of what task you asked it to perform.

Ask the agent to "summarize emails from last week" and it could technically read emails from three years ago, from your confidential HR folder. Nothing in the authorization model prevents this. We're relying on the model's alignment to stay within bounds—a strategy that works until it doesn't.

Problem 2: Action sequences create emergent risk

Consider two individually innocuous permissions:

Read calendar events
Send Slack messages

An agent with both permissions could:

Read your calendar
Find a meeting titled "Confidential: Q4 Acquisition Discussion"
Post that meeting's details to a public Slack channel

Each permission check passes. The combination creates a data exfiltration pathway that neither permission individually reveals. RBAC has no mechanism to reason about action sequences or their emergent risks.

Problem 3: Temporal and contextual blindness

Permissions in RBAC are static. You have a role, that role has permissions, those permissions exist until revoked.

But agent authorization should be:

Task-scoped: Valid only for the current task, not indefinitely
Time-bound: Expire after the task completes or a timeout
Context-aware: Different permissions based on what the user asked
Instantly revocable: User says "stop" and all access terminates

A role like agent-assistant with permissions [read:documents, write:documents, read:calendar] captures none of this nuance. The agent has those permissions whether it's summarizing a document or doing something the user never requested.

Why Relationship-Based access control fits

Relationship-Based Access Control (ReBAC) models authorization as a graph of relationships between entities. Instead of asking "does this role have this permission?", ReBAC asks "does a path exist between this actor and this resource through authorized relationships?"

This graph-based model maps naturally to agentic authorization:

user:alice
  └── delegated_to → agent:session-123
                        └── for_task → task:weekly-update
                                          ├── can_read → linear:project-eng
                                          ├── can_read → slack:channel-team
                                          └── can_write → notion:page-updates

The relationships encode:

Who delegated authority (alice)
To whom (agent session 123)
For what purpose (the weekly-update task)
With what scope (specific resources, specific operations)

Revocation is simple: delete the delegated_to relationship, and all downstream access disappears. The graph structure makes this transitive by design.

Building the authorization model

Let's build a concrete authorization model for agentic systems using OpenFGA, an open-source ReBAC implementation. I'll walk through the model design, implementation patterns, and integration architecture.

The type system

First, we define the entities and relationships in our authorization model:

# OpenFGA Authorization Model (DSL format)
model
  schema 1.1

type user

type agent
  relations
    define owner: [user]
    define active_session: [user]

type task
  relations
    define delegator: [user]
    define assignee: [agent]
    define can_access: [resource]

type resource
  relations
    define owner: [user]
    define reader: [user, agent, task]
    define writer: [user, agent, task]

type tool
  relations
    define can_invoke: [user, agent, task]

This model establishes:

Users own agents and resources
Agents have owners and track active sessions
Tasks are the unit of delegation—they connect users, agents, and accessible resources
Resources can grant read/write access to users, agents, or tasks
Tools can be invoked by users, agents, or scoped to specific tasks

The delegation pattern

When a user initiates an agent task, we create a task object that acts as a permission boundary:

from openfga_sdk import OpenFgaClient, ClientConfiguration
from openfga_sdk.models import ClientTuple
import uuid
from datetime import datetime, timedelta

class AgentAuthorizationService:
    def __init__(self, openfga_client: OpenFgaClient, store_id: str):
        self.client = openfga_client
        self.store_id = store_id

    async def create_task_delegation(
        self,
        user_id: str,
        agent_id: str,
        task_description: str,
        allowed_resources: list[dict],
        ttl_minutes: int = 30
    ) -> str:
        """
        Create a task-scoped delegation from user to agent.

        This establishes a permission boundary: the agent can only
        access resources explicitly linked to this task.
        """
        task_id = f"task:{uuid.uuid4()}"
        expires_at = datetime.utcnow() + timedelta(minutes=ttl_minutes)

        tuples = [
            # User delegates to this task
            ClientTuple(
                user=f"user:{user_id}",
                relation="delegator",
                object=task_id
            ),
            # Agent is assigned to this task
            ClientTuple(
                user=f"agent:{agent_id}",
                relation="assignee",
                object=task_id
            ),
        ]

        # Scope specific resources to this task
        for resource in allowed_resources:
            resource_id = resource["id"]
            access_level = resource.get("access", "reader")

            tuples.append(
                ClientTuple(
                    user=task_id,
                    relation=access_level,
                    object=f"resource:{resource_id}"
                )
            )

        await self.client.write(
            body={"writes": {"tuple_keys": tuples}},
            options={"store_id": self.store_id}
        )

        # Store task metadata (for expiration handling)
        await self._store_task_metadata(task_id, {
            "user_id": user_id,
            "agent_id": agent_id,
            "description": task_description,
            "expires_at": expires_at.isoformat(),
            "created_at": datetime.utcnow().isoformat()
        })

        return task_id

The critical insight here: the agent doesn't receive broad permissions. It receives assignment to a task, and that task has specific resource access. The agent's authority is bounded by the task's scope.

Checking authorization

Before any agent action, we verify authorization through the task relationship:

async def check_agent_resource_access(
    self,
    agent_id: str,
    task_id: str,
    resource_id: str,
    access_type: str = "reader"
) -> tuple[bool, str]:
    """
    Check if an agent can access a resource within a task context.

    Returns (authorized: bool, reason: str)
    """
    # First: Is this agent assigned to this task?
    agent_assigned = await self.client.check(
        body={
            "tuple_key": {
                "user": f"agent:{agent_id}",
                "relation": "assignee",
                "object": task_id
            }
        },
        options={"store_id": self.store_id}
    )

    if not agent_assigned.allowed:
        return False, "Agent not assigned to this task"

    # Second: Does this task have access to this resource?
    task_has_access = await self.client.check(
        body={
            "tuple_key": {
                "user": task_id,
                "relation": access_type,
                "object": f"resource:{resource_id}"
            }
        },
        options={"store_id": self.store_id}
    )

    if not task_has_access.allowed:
        return False, f"Task does not have {access_type} access to resource"

    return True, "Authorized"

This two-step check ensures:

The agent is legitimately working on this task
The task has been granted access to this resource

An agent can't access resources outside its assigned task, even if it has accessed those resources in previous tasks.

Revocation That Actually Works

When a user cancels a task or the task expires, all derived permissions must terminate:

async def revoke_task(self, task_id: str) -> dict:
    """
    Revoke a task and all its associated permissions.

    This is where ReBAC shines: deleting the task relationships
    cascades to remove all resource access.
    """
    # Read all tuples where this task is involved
    related_tuples = await self.client.read(
        body={"tuple_key": {"object": task_id}},
        options={"store_id": self.store_id}
    )

    # Also get tuples where task is the user (accessing resources)
    task_access_tuples = await self.client.read(
        body={"tuple_key": {"user": task_id}},
        options={"store_id": self.store_id}
    )

    all_tuples = (
        related_tuples.tuples + 
        task_access_tuples.tuples
    )

    if all_tuples:
        await self.client.write(
            body={
                "deletes": {
                    "tuple_keys": [
                        {
                            "user": t.key.user,
                            "relation": t.key.relation,
                            "object": t.key.object
                        }
                        for t in all_tuples
                    ]
                }
            },
            options={"store_id": self.store_id}
        )

    await self._delete_task_metadata(task_id)

    return {
        "task_id": task_id,
        "tuples_revoked": len(all_tuples),
        "status": "revoked"
    }

Compare this to RBAC revocation, where you'd need to track every permission granted, remember which ones were for this task versus others, and selectively revoke. The relationship graph makes the task a natural permission boundary that can be deleted atomically.

The authorization gateway

Individual authorization checks aren't enough. We need a gateway that sits between the agent and all external tools/resources, enforcing authorization on every action:

from typing import Callable, Any
from functools import wraps
import asyncio

class AuthorizationGateway:
    """
    Gateway that wraps all agent tool calls with authorization checks.
    """

    def __init__(self, auth_service: AgentAuthorizationService):
        self.auth_service = auth_service
        self.audit_log = []

    def authorized_tool(
        self,
        resource_extractor: Callable[[dict], str],
        access_type: str = "reader"
    ):
        """
        Decorator that wraps a tool function with authorization.

        Args:
            resource_extractor: Function to extract resource ID from tool args
            access_type: Required access level (reader, writer)
        """
        def decorator(tool_func: Callable):
            @wraps(tool_func)
            async def wrapper(
                agent_id: str,
                task_id: str,
                **kwargs
            ) -> Any:
                resource_id = resource_extractor(kwargs)

                # Check authorization
                authorized, reason = await self.auth_service.check_agent_resource_access(
                    agent_id=agent_id,
                    task_id=task_id,
                    resource_id=resource_id,
                    access_type=access_type
                )

                # Audit logging (always, regardless of outcome)
                audit_entry = {
                    "timestamp": datetime.utcnow().isoformat(),
                    "agent_id": agent_id,
                    "task_id": task_id,
                    "tool": tool_func.__name__,
                    "resource_id": resource_id,
                    "access_type": access_type,
                    "authorized": authorized,
                    "reason": reason
                }
                self.audit_log.append(audit_entry)

                if not authorized:
                    raise AuthorizationError(
                        f"Unauthorized: {reason}",
                        audit_entry=audit_entry
                    )

                # Execute the actual tool
                return await tool_func(**kwargs)

            return wrapper
        return decorator


class AuthorizationError(Exception):
    def __init__(self, message: str, audit_entry: dict):
        super().__init__(message)
        self.audit_entry = audit_entry

Now we can wrap our tools:

gateway = AuthorizationGateway(auth_service)

@gateway.authorized_tool(
    resource_extractor=lambda args: args["document_id"],
    access_type="reader"
)
async def read_document(document_id: str) -> str:
    """Read a document from the document store."""
    return await document_store.get(document_id)


@gateway.authorized_tool(
    resource_extractor=lambda args: args["document_id"],
    access_type="writer"
)
async def update_document(document_id: str, content: str) -> bool:
    """Update a document in the document store."""
    return await document_store.update(document_id, content)


@gateway.authorized_tool(
    resource_extractor=lambda args: f"slack:{args['channel_id']}",
    access_type="writer"
)
async def post_to_slack(channel_id: str, message: str) -> bool:
    """Post a message to a Slack channel."""
    return await slack_client.post_message(channel_id, message)

Every tool invocation now passes through authorization. The agent can't bypass it—the tools simply don't execute without valid authorization context.

Handling scope inference

One challenge remains: how do we determine which resources a task should have access to? The user says "summarize my emails from last week"—we need to translate that into specific permission grants.

This requires a scope inference layer that runs before task creation:

from anthropic import Anthropic

class ScopeInferenceService:
    """
    Infer required resource scopes from natural language task descriptions.
    """

    SCOPE_INFERENCE_PROMPT = """Analyze the user's request and determine the minimal required resource access.

User request: {request}

Available resource types:
- email: Gmail messages (scopes: read, send, delete)
- calendar: Google Calendar (scopes: read, write)
- documents: Google Docs (scopes: read, write)
- slack: Slack channels (scopes: read, write)
- linear: Linear issues (scopes: read, write)

Output a JSON object with:
{{
  "resources": [
    {{
      "type": "resource_type",
      "id": "specific_id or pattern",
      "access": "read or write",
      "constraints": {{
        "time_range": "if applicable",
        "filters": ["any filters"]
      }}
    }}
  ],
  "reasoning": "brief explanation of why these resources are needed"
}}

Be minimal: only include resources strictly necessary for the task.
Prefer read access over write access unless modification is explicitly requested."""

    def __init__(self, anthropic_client: Anthropic):
        self.client = anthropic_client

    async def infer_scopes(
        self,
        user_request: str,
        available_resources: list[dict]
    ) -> dict:
        """
        Infer minimal required scopes from a natural language request.
        """
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": self.SCOPE_INFERENCE_PROMPT.format(
                    request=user_request
                )
            }]
        )

        # Parse the response
        import json
        scope_text = response.content[0].text

        # Extract JSON from response
        start = scope_text.find('{')
        end = scope_text.rfind('}') + 1
        scope_data = json.loads(scope_text[start:end])

        # Validate against available resources
        validated_resources = []
        for resource in scope_data["resources"]:
            if self._resource_available(resource, available_resources):
                validated_resources.append(resource)

        return {
            "resources": validated_resources,
            "reasoning": scope_data.get("reasoning", ""),
            "original_request": user_request
        }

    def _resource_available(
        self,
        requested: dict,
        available: list[dict]
    ) -> bool:
        """Check if a requested resource is in the available set."""
        for avail in available:
            if (avail["type"] == requested["type"] and 
                self._id_matches(requested["id"], avail["id"])):
                return True
        return False

    def _id_matches(self, requested_id: str, available_id: str) -> bool:
        """Check if a requested resource ID matches an available one."""
        if requested_id == available_id:
            return True
        if "*" in available_id:
            pattern = available_id.replace("*", ".*")
            import re
            return bool(re.match(pattern, requested_id))
        return False

The complete flow becomes:

async def initiate_agent_task(
    user_id: str,
    agent_id: str,
    user_request: str
) -> dict:
    """
    Complete flow: user request → scope inference → task creation → agent execution
    """
    # 1. Get user's available resources
    user_resources = await get_user_resources(user_id)

    # 2. Infer minimal required scopes
    scope_service = ScopeInferenceService(anthropic_client)
    inferred_scopes = await scope_service.infer_scopes(
        user_request=user_request,
        available_resources=user_resources
    )

    # 3. Create task with scoped permissions
    auth_service = AgentAuthorizationService(openfga_client, store_id)
    task_id = await auth_service.create_task_delegation(
        user_id=user_id,
        agent_id=agent_id,
        task_description=user_request,
        allowed_resources=inferred_scopes["resources"],
        ttl_minutes=30
    )

    # 4. Return task context for agent execution
    return {
        "task_id": task_id,
        "scopes": inferred_scopes,
        "agent_id": agent_id,
        "status": "ready"
    }

Production considerations

TTL and automatic expiration

Tasks should expire automatically. Implement a background job that cleans up expired tasks:

async def cleanup_expired_tasks():
    """
    Background job to revoke expired tasks.
    Run every minute via scheduler.
    """
    expired_tasks = await get_expired_task_ids()

    for task_id in expired_tasks:
        try:
            await auth_service.revoke_task(task_id)
            logger.info(f"Revoked expired task: {task_id}")
        except Exception as e:
            logger.error(f"Failed to revoke task {task_id}: {e}")

Audit trail for compliance

Every authorization decision should be logged for audit:

@dataclass
class AuditEvent:
    timestamp: datetime
    event_type: str  # "delegation_created", "access_checked", "access_denied", "task_revoked"
    user_id: str
    agent_id: str
    task_id: str
    resource_id: Optional[str]
    decision: str  # "allowed", "denied"
    reason: str
    metadata: dict

async def log_audit_event(event: AuditEvent):
    """Log to your audit system of choice."""
    await audit_store.write(event)

Performance: Caching Authorization Decisions

Authorization checks happen on every tool call. For performance, implement caching with appropriate invalidation:

from functools import lru_cache
import hashlib

class CachedAuthorizationService(AgentAuthorizationService):
    def __init__(self, *args, cache_ttl_seconds: int = 60, **kwargs):
        super().__init__(*args, **kwargs)
        self.cache = {}
        self.cache_ttl = cache_ttl_seconds

    async def check_agent_resource_access(
        self,
        agent_id: str,
        task_id: str,
        resource_id: str,
        access_type: str = "reader"
    ) -> tuple[bool, str]:
        cache_key = f"{agent_id}:{task_id}:{resource_id}:{access_type}"

        # Check cache
        if cache_key in self.cache:
            entry = self.cache[cache_key]
            if datetime.utcnow() < entry["expires_at"]:
                return entry["result"]

        # Cache miss - perform actual check
        result = await super().check_agent_resource_access(
            agent_id, task_id, resource_id, access_type
        )

        # Cache the result (shorter TTL for denials)
        ttl = self.cache_ttl if result[0] else 10
        self.cache[cache_key] = {
            "result": result,
            "expires_at": datetime.utcnow() + timedelta(seconds=ttl)
        }

        return result

    def invalidate_task_cache(self, task_id: str):
        """Invalidate all cache entries for a task."""
        keys_to_delete = [
            k for k in self.cache.keys() 
            if task_id in k
        ]
        for key in keys_to_delete:
            del self.cache[key]

The Bigger Picture

This authorization model is a foundation, not a complete solution. Real-world deployments will need:

Chain-of-thought authorization: Validating not just individual tool calls, but sequences of calls that might create emergent risks. This requires pattern detection on action sequences—a topic for a future post.

User confirmation flows: For high-risk operations, the authorization system should pause execution and request explicit user confirmation rather than auto-allowing based on inferred scopes.

Cross-agent authorization: When agents delegate to other agents (increasingly common in multi-agent systems), the delegation chain needs to preserve authorization context and enforce attenuation—each hop can only reduce permissions, never expand them.

Federated authorization: As AI agents operate across organizational boundaries, we'll need standards for expressing and verifying delegated authority across trust domains.

Closing thoughts

We're at an inflection point with agentic AI. The capabilities are advancing faster than the security models to contain them. Every week brings new frameworks for building agents, new tools for them to invoke, new integrations to connect—and almost none of it comes with authorization baked in.

The patterns in this post aren't theoretical. They're the minimum viable security for any production agent system. The task-scoped delegation model, the authorization gateway, the audit trail—these should be table stakes, not advanced features.

The good news: the building blocks exist. OpenFGA and similar ReBAC systems provide the authorization primitives. The patterns are portable across implementations. What's missing is adoption.

If you're building agentic systems, I'd encourage you to implement authorization from day one. Retrofitting it later is painful, and the security risks of uncontrolled agents are too significant to defer.

The age of autonomous AI systems is here. Let's make sure they operate within appropriate boundaries.

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

The code examples in this post are available on GitHub.

Context Engineering: The critical Infrastructure challenge in production LLM systems

Siddhant Khare — Mon, 17 Nov 2025 17:03:08 +0000

The $10M question nobody's asking

While the industry obsesses over model parameters and training costs, we're collectively ignoring a production bottleneck that's costing organizations millions: inefficient context management.

I recently analyzed production LLM deployments across multiple organizations and found something striking: 65-80% of tokens sent to LLMs are redundant, irrelevant, or poorly structured. When you're processing billions of tokens monthly at $0.01-0.06 per 1K tokens, this inefficiency translates to substantial operational waste, not just in dollars, but in latency, throughput, and user experience.

Context engineering isn't just optimization, it's foundational infrastructure for production AI systems. And yet, most teams are still treating it as an afterthought.

The real problem: Context isn't just data

The naive approach to LLM context looks something like this:

# What most teams do
context = "\n".join([
    read_file("docs/api.md"),
    read_file("docs/examples.md"),
    fetch_similar_docs(query, k=10),
    get_conversation_history()
])

response = llm.generate(prompt + context)  # 🔥 Money burning

This fails in production for three critical reasons:

1. The token economics don't scale

At enterprise scale, context inefficiency compounds exponentially. Consider a customer support system handling 100K requests daily:

Average context: 4,000 tokens (mostly redundant)
Optimized context: 1,200 tokens (same information density)
Savings: 280M tokens/month = $16,800/month on GPT-4 alone

Multiply this across multiple LLM endpoints, development environments, and experimentation workflows, and you're looking at six-figure annual waste.

2. Latency kills user experience

Every unnecessary token adds ~0.05-0.1ms to inference latency. In real-time applications, code completion, conversational AI, live analysis, this compounds:

4,000 token context: ~200-400ms baseline latency
1,200 token context: ~60-120ms baseline latency
Result: 2-3x faster time-to-first-token

In my research on GPU bottlenecks in LLM inference (published findings from my LLMTraceFX work), I found that memory bandwidth saturation accounts for 47-63% of inference latency. Context bloat directly exacerbates this bottleneck.

3. Information density matters more than volume

Here's the counterintuitive insight: more context doesn't mean better results. I ran controlled experiments comparing dense, relevant context versus exhaustive context dumps:

Context Strategy	Tokens	Accuracy	Hallucination Rate
Exhaustive dump	8,000	73%	18%
TF-IDF filtered	2,400	81%	12%
Hybrid optimized	1,800	84%	8%

The model performs better with less but higher-quality context. This aligns with recent research on attention dilution in long-context scenarios.

The Architecture of Context Engineering

After months of production experience and extensive research, I've developed a systematic approach to context engineering. This is the architecture that powers ContextLab, an open-source toolkit I built to address these exact challenges.

Layer 1: Intelligent Tokenization

Not all tokenizers are created equal. GPT-4 uses ~750 tokens for text that Claude processes in ~650 tokens. This 15% variance matters at scale.

# Multi-model tokenization analysis
from contextlab import analyze

report = analyze(
    paths=["docs/*.md"],
    model="gpt-4o-mini",  # Cross-validate against target model
    chunk_size=512,       # Optimal for most embedding models
    overlap=50            # Preserve semantic continuity
)

Key insight: Always tokenize using your target model's tokenizer. Pre-processing with a mismatched tokenizer can introduce 10-20% estimation errors.

Layer 2: Semantic Chunking

Traditional fixed-size chunking breaks semantic boundaries. I implement content-aware chunking that respects:

Code boundaries: Functions, classes, modules
Document structure: Sections, paragraphs, lists
Semantic coherence: Measured via embedding similarity

# Semantic-aware chunking preserves context integrity
┌─────────────────────────┐
│ def process_payment():  │  Chunk 1: Complete function
│   validate_card()       │  (maintains code semantics)
│   charge_amount()       │
│   send_receipt()        │
└─────────────────────────┘

┌─────────────────────────┐
│ ## Error Handling       │  Chunk 2: Complete section
│ Our system implements.. │  (preserves documentation flow)
│ - Retry logic           │
│ - Circuit breakers      │
└─────────────────────────┘

Layer 3: Redundancy detection

Production contexts often contain massive duplication, repeated examples, similar documentation sections, overlapping code snippets. I use embedding-based similarity detection to identify redundant content:

# Detect near-duplicates via cosine similarity
from contextlab import detect_redundancy

redundant_pairs = detect_redundancy(
    chunks=report.chunks,
    threshold=0.85  # Cosine similarity cutoff
)

# Results: Found 234 redundant chunks (28% of corpus)
# Potential savings: 3,400 tokens per request

Technical detail: I compute embeddings using OpenAI's text-embedding-3-small (1536 dimensions), then use vectorized cosine similarity with NumPy for sub-millisecond performance on 10K+ chunk corpuses.

Layer 4: Salience scoring

Not all content is equally valuable. I implement TF-IDF-inspired salience scoring to rank chunks by information density:

# Score chunks by relevance to query
salience_scores = compute_salience(
    chunks=report.chunks,
    query_embedding=query_emb,
    weights={
        'similarity': 0.6,    # Semantic relevance
        'uniqueness': 0.2,    # Inverse redundancy
        'recency': 0.2        # Temporal relevance
    }
)

This multi-factor scoring enables intelligent pruning while preserving high-value context.

Layer 5: Compression strategies

ContextLab implements four core compression strategies, composable for hybrid optimization:

Deduplication (Fast, Conservative)

Remove near-duplicate chunks while preserving unique information. Best for documentation and knowledge bases with repetitive content.

Compression ratio: 1.2-1.8x
Latency overhead: <5ms
Information loss: <2%

Extractive Summarization (Balanced)

Select the most salient sentences from each chunk, maintaining original phrasing.

Compression ratio: 2-3x
Latency overhead: ~50ms per chunk
Information loss: 5-10%

LLM Summarization (Aggressive, Expensive)

Use a smaller model (e.g., GPT-4o-mini) to generate concise summaries.

Compression ratio: 3-5x
Latency overhead: ~200ms per chunk
Information loss: 10-15%, but better semantic preservation

Sliding Window (Temporal)

Maintain only the N most recent chunks. Critical for conversational contexts with temporal relevance decay.

Compression ratio: Configurable
Latency overhead: ~1ms
Information loss: Depends on window size

Layer 6: Budget optimization

The final layer solves a constrained optimization problem: maximize information density under a token budget.

from contextlab import optimize

# Greedy optimization with salience-based selection
plan = optimize(
    report=report,
    limit=8000,              # Target token budget
    strategy="hybrid",        # Combine multiple strategies
    priority="relevance"     # Optimize for semantic relevance
)

print(f"Compressed {report.total_tokens} → {plan.final_tokens} tokens")
print(f"Kept {len(plan.kept_chunks)}/{len(report.chunks)} chunks")
print(f"Salience score: {plan.avg_salience:.3f}")

Algorithm: I use a greedy knapsack approach with salience-weighted selection. For most workloads, this achieves 95%+ of optimal results with O(n log n) complexity versus O(2^n) for exhaustive search.

Observability: You Can't Optimize What You Can't Measure

One of ContextLab's core innovations is comprehensive observability into context operations:

Token timeline visualization

Track how context evolves across compression stages:

Original:  ████████████████████████ 12,400 tokens
Dedup:     ████████████████ 8,600 tokens (-31%)
Summarize: ██████████ 5,200 tokens (-40%)
Optimize:  ██████ 2,800 tokens (-46%)

Embedding space analysis

UMAP-reduced scatter plots reveal:

Cluster density: Are chunks semantically diverse?
Redundancy patterns: Visual identification of duplicates
Coverage gaps: Underrepresented topics in compressed context

Salience distribution

Histogram analysis of chunk importance scores guides threshold tuning:

Salience distribution (n=1,450 chunks):
0.0-0.2: ████ (180 chunks) - Low value, safe to drop
0.2-0.4: ████████ (420 chunks) - Medium value
0.4-0.6: ████████████ (580 chunks) - High value
0.6-0.8: ██████ (220 chunks) - Critical content
0.8-1.0: ██ (50 chunks) - Must-include chunks

Real-world impact: A case study

I had chatted with a team building an AI-powered code review system. Their initial implementation:

Context per review: ~15,000 tokens (entire file + git diff + similar PRs)
Cost per review: $0.90 (GPT-4)
P95 latency: 4.2 seconds
Daily volume: 2,000 reviews = $1,800/day

After implementing context engineering with ContextLab:

# Optimized context pipeline
report = analyze(paths=changed_files, model="gpt-4")

# Hybrid compression: dedup + extract + optimize
plan = optimize(
    report,
    limit=4000,
    strategy="hybrid",
    priority="code_relevance"
)

compressed_context = plan.to_prompt()

Results:

Context per review: ~4,200 tokens (72% reduction)
Cost per review: $0.25 (72% savings)
P95 latency: 1.8 seconds (57% faster)
Daily savings: $1,300 → $474,500/year

More importantly, code review accuracy improved from 76% to 83% because the model received higher-density, more relevant context.

The Future: Context Engineering as Infrastructure

Context engineering isn't a feature, it's foundational infrastructure for production LLM systems. As we move toward increasingly complex agentic architectures, context management becomes even more critical.

Trend 1: Multi-agent context coordination

In multi-agent systems, context isn't just about individual requests, it's about shared state management across autonomous agents. Future context engineering must handle:

Context handoffs: Efficiently transferring compressed state between agents
Hierarchical compression: Different compression strategies for different agent tiers
Conflict resolution: Managing overlapping or contradictory context from multiple sources

Trend 2: Real-Time adaptive compression

Static compression strategies are suboptimal. I'm researching adaptive compression that adjusts based on:

Query characteristics: Technical questions need different context than creative tasks
Model capabilities: Claude 3.5 handles longer contexts better than GPT-4o-mini
Latency requirements: Real-time systems prioritize speed over exhaustiveness

Trend 3: Context security & compliance

As LLMs process sensitive data, context engineering must incorporate:

PII detection and redaction during compression
Access control at the chunk level
Audit trails for context usage
Differential privacy guarantees on embeddings

This is where my focus on agent infrastructure and security becomes critical. Context engineering isn't just optimization, it's a security and compliance layer for production AI.

Call to Action: Build Context Intelligence into Your Stack

If you're building with LLMs in production, here's my recommendation:

Week 1: Measure

Instrument your context pipeline. Track:

Token counts per request (by model)
Redundancy rates
Compression ratios
Cost per request

Week 2: Analyze

Run your production contexts through analysis tools:

pip install contextlab
contextlab analyze your_contexts/ --model gpt-4o-mini --out .contextlab
contextlab viz .contextlab/<run_id>  # Visualize results

Week 3: Optimize

Implement compression strategies:

Start conservative (deduplication only)
A/B test compressed vs. uncompressed contexts
Measure accuracy, latency, and cost impact

Week 4: Automate

Build context engineering into your CI/CD:

# In your LLM endpoint
from contextlab import optimize

@app.post("/api/generate")
async def generate(request: Request):
    # Automatic context optimization
    optimized = optimize(
        request.context,
        limit=request.model.max_context * 0.7,  # Leave room for response
        strategy="hybrid"
    )

    return await llm.generate(optimized.to_prompt())

Open Source and Community

ContextLab is fully open source (MIT licensed) and designed for extensibility. The toolkit provides:

Python SDK for programmatic integration
REST API for language-agnostic usage
CLI tools for analysis and debugging
Web dashboard for visualization

I built this independently to solve real production challenges, and I'm actively looking for collaborators and contributors. Whether you're optimizing costs, reducing latency, or researching context compression algorithms, this is infrastructure we all need.

GitHub: github.com/Siddhant-K-code/ContextLab

Closing thoughts

Context engineering represents a fundamental shift in how we think about LLM infrastructure. It's not about prompt engineering, it's about information architecture for AI systems.

As models get larger and more capable, the constraint shifts from model intelligence to context quality. Teams that master context engineering will have a significant competitive advantage: lower costs, faster systems, better accuracy, and stronger security.

The tools are here. The methodologies are proven. The economics are compelling.

The question is: will you continue burning tokens, or will you build intelligence into your context layer?

Connect on LinkedIn | GitHub | X/Twitter

Interested in collaborating on context engineering research or contributing to ContextLab? DM me on Twitter.

AWS S3 Vectors at scale: Real performance numbers at 10 million Vectors

Siddhant Khare — Thu, 06 Nov 2025 10:42:37 +0000

Introduction

AWS S3 Vectors promises "billions of vectors with sub-second queries" and up to 90% cost savings over traditional vector databases. These claims sound good on paper, but implementation details matter. How does performance actually scale? What's the accuracy trade-off? Are there operational gotchas?

This post presents empirical benchmarks testing S3 Vectors from 10,000 to 10 million vectors, comparing performance and accuracy against FAISS and NMSLib. All code used boto3 on us-east-1, measuring real-world query latency including network overhead.

What is S3 Vectors?

S3 Vectors is AWS's managed vector search service that stores and queries vector embeddings directly in S3. Key characteristics:

Native S3 integration with standard durability/availability guarantees
Maximum 50 million vectors per index
Maximum 4096 dimensions per vector
Supports cosine similarity and euclidean distance
Accessed via boto3 query_vectors API

The value proposition is operational simplicity and cost reduction. You don't manage infrastructure, handle index building, or worry about scaling - you just store vectors in S3 and query them.

Experimental Setup

Dataset

Primary dataset: UKBench

10,200 images containing 2,550 distinct objects (4 images per object)
Used for both queries and database (search should return same object images)
Metric: Recall@4 - percentage of correct object images in top 4 results
Since query images exist in database, top result is always the query itself

Distractor dataset: Microsoft COCO 2017

Random crops used to scale database to 10M vectors
Provides realistic noise for large-scale testing

Vector Embeddings

DINOv3 (self-supervised vision transformer) for image embeddings:

Model	Vector Dimensions
ViT-S/16 distilled	384
ViT-B/16 distilled	768
ViT-L/16 distilled	1024

Chose DINOv3 for strong performance on image retrieval tasks without fine-tuning.

Infrastructure

S3 Vectors: us-east-1 bucket, queries from CloudShell (same region)
Local baseline: Intel Core i7-13700KF (16c/24t), 32GB RAM
Measurement: Query time from sending vector via query_vectors to receiving results
- Does NOT include embedding generation time
- DOES include network latency and API overhead
- Measured per individual query (not batched)

Comparison Methods

FAISS (Facebook AI Similarity Search)

IndexHNSWFlat with m=32, efConstruction=512
Graph-based approximate nearest neighbor search
Run locally (no network overhead)

NMSLib (Non-Metric Space Library)

HNSW method with default parameters
Another HNSW implementation for comparison
Run locally (no network overhead)

Brute-force search

NumPy inner product (@operator) computed per query
True nearest neighbors (100% recall baseline)
Run locally (no network overhead)

Important caveat: Local execution eliminates network latency, giving FAISS/NMSLib inherent speed advantages unrelated to algorithm quality.

Results: Scaling Vector Count

Testing from 10K to 10M vectors with 384-dimensional embeddings, topK=5:

Vectors	Query Time (ms)	Recall@4
10,200	112	0.968
100,000	137	0.973
500,000	170	0.969
1,000,000	207	0.969
10,000,000	382	0.908

Absolute Processing Time

S3 Vectors query time grows from 112ms at 10K vectors to 382ms at 10M vectors - a 3.4x increase for a 1000x data increase.

Key Observations

Query latency scales sublinearly: Moving from 10K to 10M vectors (1000x increase) results in only 3.4x latency increase. This suggests efficient indexing that doesn't degrade linearly with dataset size.

Sub-second queries achieved: At 10M vectors, queries complete in 382ms. AWS's "sub-second" claim holds at this scale.

Accuracy remains strong: Recall@4 stays above 90% even at 10M scale. The drop from 0.97 to 0.91 indicates some accuracy trade-off with scale, but still delivers relevant results.

Fixed overhead dominates at small scale: The 112ms baseline at 10K vectors includes network/API overhead. This makes S3 Vectors less competitive for small datasets where local search would be faster.

Comparison: S3 Vectors vs Alternatives

Absolute Query Times

Vectors	FAISS (local)	NMSLib (local)	S3 Vectors	Brute-force (local)
10,200	0.03 ms	0.02 ms	112 ms	0.05 ms
100,000	0.06 ms	0.03 ms	137 ms	2.78 ms
1,000,000	0.10 ms	0.05 ms	207 ms	25.6 ms
10,000,000	0.27 ms	0.09 ms	382 ms	381 ms

Local execution is orders of magnitude faster due to no network overhead. However, this ignores infrastructure costs.

Processing Time Ratio (Normalized to 10K baseline)

To understand scaling behavior independent of fixed costs, normalize each method's 10K time to 1.0:

Vectors	FAISS	NMSLib	S3 Vectors	Brute-force
10,200	1.0x	1.0x	1.0x	1.0x
1,000,000	2.7x	2.4x	1.8x	512x
10,000,000	8.1x	5.1x	3.4x	7620x

S3 Vectors scales better than FAISS/NMSLib when normalized. This is surprising and suggests AWS's indexing approach handles growth efficiently.

Note: This comparison has limitations. Different HNSW parameters would change FAISS/NMSLib results. The key takeaway is that S3 Vectors' scaling characteristics are competitive with established ANN libraries.

Accuracy Comparison

Vectors	FAISS	NMSLib	S3 Vectors	Brute-force
10,200	0.970	0.950	0.968	0.970
1,000,000	0.970	0.930	0.969	0.970
10,000,000	0.910	0.800	0.908	0.970

At 10M scale, S3 Vectors matches FAISS accuracy and significantly outperforms NMSLib (though this is likely due to parameter tuning differences rather than fundamental algorithm quality).

Accuracy degrades with scale for all ANN methods. This is expected - approximate search trades some accuracy for speed. The degradation rate for S3 Vectors is comparable to tuned FAISS.

Impact of Vector Dimensionality

Testing dimension scaling with 100K vectors:

Dimensions	Query Time (ms)	Recall@4
384	137	0.973
768	151	0.983
1024	158	0.988
4096	215	0.988

Dimension scaling is gentle: Going from 384 to 4096 dimensions (10.7x increase) adds only 57% latency. Higher dimensional vectors capture more information, improving accuracy with modest performance cost.

Dimensionality reduction likely unnecessary: The small performance gain from reducing dimensions probably isn't worth the accuracy loss for most use cases.

Operational Findings

1. topK Returns K-1 Results Intermittently

Issue: The topK parameter specifies how many results to return, but approximately 20% of queries return K-1 results instead of K.

Details:

No reproducible pattern
Occurs across different K values
Same query returns different result counts on repeated execution
No documented explanation in AWS docs

Impact: Applications must handle variable result counts. Cannot assume topK results will always return.

Workaround: Request topK+1 if exactly K results are required, though this doesn't guarantee K results either.

2. Vector Deletion is Extremely Slow

Measurement: delete_vectors processes 3-4 vectors per second via boto3.

Comparison: put_data inserts ~500 vectors per second (100x faster).

Impact: Deleting large numbers of vectors is impractical. For 10M vectors, deletion would take ~30 days.

Recommendation: For bulk deletion, recreate the vector index rather than delete individual vectors.

3. Vector Ingestion at Scale

Rate: put_data accepts maximum 500 vectors per call, completing in ~1 second for low dimensions.

At 10M scale: Full ingestion takes approximately 5-6 hours with 384-dim vectors.

Dimension impact: At 4096 dimensions, 500-vector batches sometimes fail, suggesting payload size limits. Reduce batch size for high-dimensional vectors.

4. Indexing Appears Incremental

Observation: Queries return results immediately after inserting vectors, even during ongoing bulk inserts.

Implication: S3 Vectors likely builds/updates indexes during insertion rather than requiring a separate indexing phase. This differs from traditional vector databases that build indexes after bulk load.

Advantage: No downtime waiting for index construction. New vectors become searchable quickly.

When to Use S3 Vectors

Good Fit

Cost-sensitive applications: 90% cost savings over dedicated vector DBs adds up at scale.

Moderate latency requirements: 100-500ms query latency is acceptable for many applications (semantic search, recommendation systems, content discovery).

Operational simplicity priority: No infrastructure to manage, automatic scaling, S3's durability guarantees.

Growing datasets: Sublinear scaling means performance stays reasonable as data grows.

Integration with AWS services: Native S3 storage works well with Lambda, Bedrock, SageMaker.

Poor Fit

Ultra-low latency requirements: If you need <10ms queries, local FAISS/NMSLib will outperform.

Small datasets: Network overhead dominates at small scale. Local search is faster for <100K vectors.

Frequent bulk deletions: Deletion performance makes this operationally painful.

Exact nearest neighbors required: ANN trade-offs mean 90-97% recall, not 100%.

Extremely large scale (>50M per index): Requires multiple indexes and custom orchestration.

Practical Recommendations

Start with S3 Vectors for new projects: Unless you have proven low-latency requirements, the operational benefits outweigh performance differences.
Monitor the topK bug: Build result count validation into your application logic.
Design for immutable vectors: Given slow deletion, treat vectors as append-only when possible.
Batch queries if possible: While this benchmark tested single queries, batching multiple queries per API call would amortize network overhead.
Test with your data: Accuracy and performance depend on vector characteristics. Run your own benchmarks with representative data.
Plan for multi-index if scaling beyond 50M: Design shard-aware query distribution early.

Conclusion

S3 Vectors delivers on its core promise: you can query 10 million vectors in under 400ms with ~91% recall, and costs are significantly lower than dedicated vector databases.

The sublinear scaling characteristics are impressive - performance degrades gracefully as datasets grow. Accuracy remains competitive with tuned FAISS at scale.

However, operational quirks exist: the topK bug needs workarounds, deletion is impractically slow, and small datasets don't benefit from the service.

For most ML applications where 100-500ms latency is acceptable and you value operational simplicity over raw speed, S3 Vectors is a strong default choice. The "cheap managed alternative" has become a legitimate first-class option.

Methodology Notes

All measurements represent single-query latency (no batching)
Query times include network and API overhead for S3 Vectors
Local methods (FAISS/NMSLib/brute-force) exclude network overhead
Each data point represents average across all 10,200 UKBench queries
HNSW parameters chosen for reasonable defaults, not exhaustive tuning
Code available at https://github.com/Siddhant-K-code/s3-vectors-benchmark

Why agent orchestration is harder than kubernetes - Lessons while building Agentflow

Siddhant Khare — Thu, 23 Oct 2025 16:15:12 +0000

TL;DR: While building AgentFlow, an open source orchestration engine for AI agents, I discovered fundamental differences from container orchestration. Kubernetes assumes deterministic workloads; agents are non-deterministic reasoning systems. This post explores the architectural challenges I identified and the design decisions I made to address them.

Note: AgentFlow is a personal side project built to explore agent orchestration challenges. The observations and technical decisions in this post reflect my individual learning and experimentation, and do not represent the views, products, or architecture of my employer. All code examples are from the open source AgentFlow project.

Introduction: The orchestration illusion

When I started building AgentFlow, the pitch was simple: "Kubernetes for AI agents." The analogy made sense, both systems schedule workloads, manage resources, and handle failures. One month into initial version, I learned why that comparison falls apart.

Kubernetes assumes your workload is a deterministic function: same input → same output. Containers crash cleanly. Resource needs are predictable. State is either ephemeral or in a database.

Agents break every assumption:

Non-deterministic execution: Same prompt generates different responses (temperature, model updates, context window variations)
Ambiguous failures: Agent produces output, but is it correct?
Distributed state: Reasoning context, tool outputs, external API mutations, LLM chat history
Dynamic resource needs: Token quotas, model availability, cost constraints, latency requirements
Recursive decomposition: Agent spawns sub-agents at runtime based on task complexity

This post explores why these differences make agent orchestration an order of magnitude harder, with concrete examples from production systems.

1. The non-determinism problem: When retry isn't idempotent

Kubernetes assumption

# Pod fails → K8s restarts it
# Same image + same config = same behavior
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: auth-service
    image: auth:v1.2.3
    restartPolicy: Always

Restart is safe because containers are deterministic. Same inputs → same outputs.

Agent reality

# Agent task: "Refactor database query for performance"
response = llm.generate(
    prompt="Optimize this SQL query for performance:\n{query}",
    temperature=0.7  # Non-zero temperature = non-deterministic
)

# Attempt 1: Adds index on user_id
# Attempt 2: Rewrites as JOIN instead of subquery  
# Attempt 3: Suggests denormalization

# Which is "correct"? All could work. Or none.

Implication: Retry logic is ambiguous:

Should we retry with same prompt? (might get worse output)
Different temperature? (changes behavior profile)
Different model? (GPT-5 vs Sonnet-4.5 - different reasoning styles)
Add few-shot examples from previous attempts? (context pollution)

Real production failure

I had an agent that wrote Terraform configurations. On retry after timeout:

First attempt: Created 90% of infrastructure
Retry: Generated different resource names
Result: Duplicate infrastructure, half-configured state

K8s equivalent would be: Pod restarts and creates new database tables with different schemas each time.

Our solution: Semantic checkpointing

struct AgentCheckpoint {
    task_id: Uuid,
    reasoning_trace: Vec<ReasoningStep>,  // What agent decided and why
    tool_outputs: HashMap<String, ToolResult>,  // External state mutations
    partial_results: Vec<Artifact>,  // Code, configs, etc.
    context_hash: String,  // Hash of prompt + context for replay detection
}

// On failure, we:
// 1. Load checkpoint
// 2. Replay tool outputs (don't re-execute)
// 3. Resume with explicit "continue from step N" prompt
// 4. Compare context_hash to detect if inputs changed

Key insight: Checkpoints must capture intent and reasoning, not just state. When resuming, agent needs to understand what it was trying to do, not just what it did.

2. Failure detection: When "success" is ambiguous

Kubernetes failure modes

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 3

readinessProbe:
  tcpSocket:
    port: 8080

Binary outcome: process alive/dead, port open/closed, HTTP 200/500.

Agent failure modes

# Agent task: "Write unit tests for auth module"
output = agent.execute(task)

# Agent returns HTTP 200 with:
"""
def test_login():
    user = User("test")
    assert user.login() == True  # Useless test
"""

# Questions:
# - Is this a "failure"? Code is syntactically valid
# - Test doesn't actually validate auth logic
# - How do we detect this programmatically?

Failure categories I've encountered:

Syntactic failure: Invalid code, malformed JSON (easy to detect)
Semantic failure: Valid code that doesn't solve the task (hard)
Partial completion: 70% correct, 30% missing (do I retry the whole task?)
Hallucinated success: Agent claims completion but didn't execute tools
Silent degradation: Output works but is suboptimal (performance, security)

Detection strategies

Approach 1: Programmatic validation

enum ValidationResult {
    Pass,
    Fail(String),
    Uncertain,  // Can't determine programmatically
}

impl Validator {
    fn validate_code_output(&self, output: &str) -> ValidationResult {
        // 1. Syntax check (AST parsing)
        if let Err(e) = parse_syntax(output) {
            return ValidationResult::Fail(format!("Syntax error: {}", e));
        }

        // 2. Execute tests
        if let Err(e) = run_tests(output) {
            return ValidationResult::Fail(format!("Tests failed: {}", e));
        }

        // 3. Static analysis
        let issues = run_linters(output);
        if issues.critical > 0 {
            return ValidationResult::Fail(format!("Critical issues: {:?}", issues));
        }

        // 4. Semantic validation - HARD
        // How do we know if tests actually validate auth logic?
        ValidationResult::Uncertain
    }
}

Approach 2: Evaluation agents (agent-as-judge)

# Separate agent evaluates output quality
evaluation_prompt = f"""
Task: {original_task}
Output: {agent_output}

Does this output successfully complete the task?
Score 0-10 and explain your reasoning.
"""

score = evaluator_agent.generate(evaluation_prompt)

if score < 7:
    # Retry or escalate to human

Problem: Evaluation agent can also hallucinate. I found 15% false positive rate (marked bad output as good).

Approach 3: Outcome-based validation

// Don't validate the code, validate if it works end-to-end
async fn validate_deployment(task_id: &str) -> Result<(), ValidationError> {
    // Agent wrote deployment config
    // Actually deploy to staging
    deploy_to_staging(task_id).await?;

    // Run integration tests
    let health = check_service_health().await?;

    // Monitor for 5 minutes
    let metrics = observe_metrics(Duration::from_secs(300)).await?;

    if metrics.error_rate > 0.01 {
        rollback_deployment(task_id).await?;
        return Err(ValidationError::HighErrorRate);
    }

    Ok(())
}

This is what I use in production for infrastructure agents: validate by outcome, not output.

3. Resource scheduling: beyond CPU and Memory

Kubernetes Resource Model

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Scheduler assigns pods to nodes based on available CPU/memory. Simple, measurable, predictable.

Agent Resource Model

struct AgentResourceRequirements {
    // Traditional resources
    cpu_cores: f32,
    memory_gb: f32,

    // LLM-specific resources
    token_quota: TokenQuota {
        input_tokens_per_minute: u32,
        output_tokens_per_minute: u32,
        provider: Provider,  // OpenAI, Anthropic, etc.
    },

    // Model requirements
    model_constraints: ModelConstraints {
        min_context_window: u32,  // Need 100k tokens for large codebase
        required_capabilities: Vec<Capability>,  // FunctionCalling, Vision, etc.
        max_cost_per_request: f32,  // Budget constraint
        max_latency_p95: Duration,  // SLA requirement
    },

    // Tool access
    required_tools: Vec<Tool>,  // GitHub, AWS, Database access

    // Quality requirements
    min_quality_score: f32,  // Some tasks need high-quality models
}

Scheduling complexity:

Token quotas are rate-limited, not capacity-limited
- K8s: Node has 8 CPU cores, can run 8 single-core pods
- Agents: Provider has 100k TPM (tokens per minute), but token usage varies wildly
- Task A: 500 tokens (simple question)
- Task B: 50k tokens (large codebase analysis)
- Can't predict how many concurrent tasks are feasible
Model availability changes dynamically

   // Morning: GPT-5 available, low latency
   // Afternoon: GPT-5 rate limited (org-wide spike)
   // Fallback to Claude? Different reasoning style might break downstream tasks

   // Evening: GPT-5 back but model updated (gpt-5-0613 → gpt-5-1106)
   // Output format slightly different, breaks parsing logic

Cost optimization vs. quality tradeoff

   // K8s: Use cheapest instance type that meets CPU/memory needs
   // Agents: Complex multi-objective optimization

   fn schedule_task(task: &Task) -> SchedulingDecision {
       let models = available_models();

       // Pareto frontier: cost vs. quality vs. latency
       let candidates = models.iter()
           .filter(|m| m.can_handle(&task.requirements))
           .map(|m| {
               let cost = estimate_cost(m, task);
               let quality = predict_quality(m, task);  // ML model trained on past tasks
               let latency = m.avg_latency + queue_time(m);

               (m, cost, quality, latency)
           })
           .collect();

       // Which to pick?
       // - Cheapest might fail task (quality too low)
       // - Best might blow budget
       // - Fastest might not be available

       optimize_by_policy(candidates, &task.priority)
   }

Our Scheduler Implementation

pub struct AgentScheduler {
    // Track real-time model availability
    model_health: Arc<RwLock<HashMap<Model, HealthStatus>>>,

    // Token quota tracking per provider
    quota_manager: QuotaManager,

    // Historical task→model performance
    performance_db: PerformanceDB,

    // Cost tracking
    budget_tracker: BudgetTracker,
}

impl AgentScheduler {
    pub async fn schedule(&self, task: Task) -> Result<Assignment, ScheduleError> {
        // 1. Filter models by hard constraints
        let viable_models = self.get_viable_models(&task).await?;

        if viable_models.is_empty() {
            return Err(ScheduleError::NoViableModel);
        }

        // 2. Check token quotas
        let available = self.quota_manager
            .get_available_quota(&viable_models)
            .await?;

        // 3. Predict task resource needs based on similar past tasks
        let estimated_tokens = self.performance_db
            .predict_token_usage(&task)
            .await?;

        // 4. Score each model
        let mut scored: Vec<(Model, f32)> = viable_models.iter()
            .filter_map(|model| {
                let quota = available.get(model)?;
                if quota.remaining < estimated_tokens {
                    return None;  // Insufficient quota
                }

                let cost_score = 1.0 - (model.cost_per_token / MAX_COST);
                let quality_score = self.performance_db
                    .get_quality_score(model, &task.task_type);
                let latency_score = 1.0 - (model.avg_latency.as_secs_f32() / MAX_LATENCY);

                // Weighted combination based on task priority
                let score = match task.priority {
                    Priority::Cost => cost_score * 0.7 + quality_score * 0.3,
                    Priority::Quality => quality_score * 0.7 + cost_score * 0.2 + latency_score * 0.1,
                    Priority::Latency => latency_score * 0.6 + quality_score * 0.3 + cost_score * 0.1,
                };

                Some((*model, score))
            })
            .collect();

        scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

        // 5. Reserve quota and assign
        let (selected_model, _) = scored.first().ok_or(ScheduleError::AllModelsExhausted)?;
        self.quota_manager.reserve(*selected_model, estimated_tokens).await?;

        Ok(Assignment {
            model: *selected_model,
            estimated_cost: model.cost_per_token * estimated_tokens as f32,
            fallback: scored.get(1).map(|(m, _)| *m),
        })
    }
}

Key differences from K8s scheduler:

Predictive (token usage) vs. declarative (CPU request)
Multi-objective optimization vs. bin packing
Real-time quota consumption tracking
Model health and version tracking
Cost-awareness is first-class concern

4. State Management: The Distributed Reasoning Problem

Kubernetes State Model

┌─────────────────┐
│   Pod (stateless)  │
└─────────────────┘
         │
         ├─> ConfigMap (immutable config)
         ├─> Secret (credentials)
         └─> PersistentVolume (durable state)

# State is externalized, pod is disposable

Agent State Model

┌──────────────────────────────────────────┐
│            Agent Task State                  │
├──────────────────────────────────────────┤
│ 1. LLM Context Window (ephemeral)          │
│    - Conversation history                   │
│    - Retrieved documents                    │
│    - Previous reasoning steps               │
│    - Max 200k tokens, then lost             │
├──────────────────────────────────────────┤
│ 2. Tool Execution Side Effects (durable)   │
│    - Files created in GitHub                │
│    - Database records modified              │
│    - Cloud resources provisioned            │
│    - Slack messages sent                    │
├──────────────────────────────────────────┤
│ 3. Reasoning State (semi-structured)       │
│    - Current subtask in decomposition       │
│    - Hypotheses being explored              │
│    - Confidence scores                      │
│    - Retry attempts                         │
├──────────────────────────────────────────┤
│ 4. Inter-Agent State (distributed)         │
│    - Results from sub-agents                │
│    - Merge conflicts                        │
│    - Dependency resolution status           │
└──────────────────────────────────────────┘

Problem: State is scattered across:

LLM provider's context (can't access directly)
External systems (GitHub, AWS, etc.)
Your orchestrator's DB
Other agents' context windows

Example: Mid-Task Failure Recovery

// Task: "Implement feature X across 3 microservices"

Agent decomposes into:
├─ Service A: Add API endpoint
│  ├─ Write code ✓ (committed to GitHub)
│  ├─ Write tests ✓ (committed to GitHub)
│  └─ Update docs ✓ (committed to GitHub)
├─ Service B: Update client library
│  ├─ Write code ✓ (committed to GitHub)
│  ├─ Write tests ✗ (TIMEOUT - agent crashed)
│  └─ Update docs ✗ (not started)
└─ Service C: Deploy configuration
   └─ Not started

// Recovery questions:
// 1. Which commits were part of this task? (need git SHA tracking)
// 2. What was Service B agent trying to do when it crashed?
// 3. Can we resume Service B without re-reading entire codebase?
// 4. Do we rollback Service A changes? Or continue forward?
// 5. If we retry, how do we prevent Service B from duplicating Service A's work?

Our Solution: Task DAG with Explicit Dependencies

#[derive(Serialize, Deserialize)]
struct TaskGraph {
    root: TaskId,
    nodes: HashMap<TaskId, TaskNode>,
    edges: Vec<Dependency>,
}

#[derive(Serialize, Deserialize)]
struct TaskNode {
    id: TaskId,
    description: String,
    assigned_agent: AgentId,
    status: TaskStatus,

    // CRITICAL: Capture side effects
    side_effects: Vec<SideEffect>,

    // CRITICAL: Capture reasoning
    reasoning_trace: Vec<ReasoningStep>,

    // Checkpoint for recovery
    checkpoint: Option<Checkpoint>,
}

#[derive(Serialize, Deserialize)]
enum SideEffect {
    GitCommit { repo: String, sha: String, branch: String },
    FileModified { path: String, hash: String },
    APICall { endpoint: String, method: String, response_status: u16 },
    DatabaseMutation { table: String, operation: DbOperation },
    CloudResource { provider: String, resource_id: String, action: String },
}

#[derive(Serialize, Deserialize)]
struct Dependency {
    from: TaskId,
    to: TaskId,
    dependency_type: DependencyType,
}

#[derive(Serialize, Deserialize)]
enum DependencyType {
    Sequential,  // B must start after A completes
    DataDependency(String),  // B needs output from A
    ConflictingResources,  // A and B can't run concurrently (e.g., both modify same file)
}

// Recovery logic
impl TaskGraph {
    async fn recover_from_failure(&mut self, failed_task: TaskId) -> RecoveryPlan {
        // 1. Find all completed upstream dependencies
        let completed_deps = self.get_completed_dependencies(failed_task);

        // 2. Check if any side effects need rollback
        let downstream_affected = self.get_downstream_tasks(failed_task);

        // 3. Determine resume strategy
        if let Some(checkpoint) = self.nodes[&failed_task].checkpoint {
            // Can resume from checkpoint
            return RecoveryPlan::Resume {
                from_checkpoint: checkpoint,
                replay_side_effects: false,  // Already done
            };
        } else if self.can_rollback_side_effects(failed_task) {
            // Clean rollback possible
            return RecoveryPlan::Rollback {
                tasks_to_rollback: vec![failed_task],
                side_effects_to_revert: self.get_side_effects(failed_task),
            };
        } else {
            // Partial state exists, need human decision
            return RecoveryPlan::RequiresHuman {
                reason: "Side effects cannot be automatically rolled back",
                affected_systems: self.get_affected_systems(failed_task),
            };
        }
    }
}

Key insight: Kubernetes can restart pods because containers don't mutate external state (ideally). Agents inherently mutate state across multiple systems, so recovery requires explicitly tracking and potentially reverting side effects.

5. Observability: Debugging Reasoning, Not Just Execution

Kubernetes Observability

# Standard signals
kubectl logs pod-name  # STDOUT/STDERR
kubectl top pod        # CPU/Memory usage
kubectl describe pod   # Events, status

# Metrics (RED method)
- Rate: Requests per second
- Errors: Error rate
- Duration: Latency percentiles

Debugging: Look at logs, find error, fix code, redeploy.

Agent Observability Requirements

struct AgentTrace {
    // Traditional metrics
    duration_ms: u64,
    tokens_used: TokenUsage,
    cost_usd: f32,
    model: String,

    // Reasoning trace - CRITICAL FOR DEBUGGING
    reasoning_steps: Vec<ReasoningStep>,

    // Tool interaction
    tool_calls: Vec<ToolCall>,

    // Quality metrics
    output_quality_score: Option<f32>,
    validation_result: ValidationResult,

    // Context
    input_hash: String,  // Detect if same input produces different output
    parent_trace_id: Option<TraceId>,  // Link to spawning agent
}

#[derive(Serialize, Deserialize)]
struct ReasoningStep {
    step_number: u32,
    thought: String,  // What agent was thinking
    action: AgentAction,  // What it decided to do
    observation: String,  // What happened after action
    confidence: f32,  // How sure the agent was (if using Chain-of-Thought)
}

Example debugging scenario:

# Bug report: "Agent deployed wrong configuration to prod"

# Traditional debugging:
kubectl logs agent-pod-xyz
# Shows: "Deployment successful"
# Useless - I know it deployed, I need to know WHY it chose that config

# Agent debugging:
SELECT reasoning_trace FROM agent_traces WHERE task_id = 'xyz';

# Returns:
{
  "step_1": {
    "thought": "User requested deployment to production",
    "action": "fetch_current_config(environment='prod')",
    "observation": "Current config uses instance type m5.large"
  },
  "step_2": {
    "thought": "To optimize cost, I'll downgrade to t3.medium",
    "action": "generate_terraform_config(instance_type='t3.medium')",
    "confidence": 0.85,
    "observation": "Generated new config"
  },
  "step_3": {
    "thought": "Config looks good, applying to prod",
    "action": "terraform_apply(environment='prod')",
    "observation": "Applied successfully"
  }
}

# Found the bug: Agent decided to "optimize cost" autonomously
# Solution: Add constraint that any cost-saving change needs approval

Observability I built:

pub struct ObservabilityPipeline {
    // Distributed tracing (similar to OpenTelemetry)
    tracer: AgentTracer,

    // Metrics
    metrics: MetricsCollector,

    // Reasoning storage
    reasoning_db: ReasoningDatabase,

    // Quality evaluation (async)
    evaluator: QualityEvaluator,
}

impl ObservabilityPipeline {
    pub async fn trace_agent_execution(
        &self,
        task: &Task,
        agent: &Agent,
    ) -> Result<AgentTrace, Error> {
        let span = self.tracer.start_span("agent_execution");

        // Wrap LLM calls to capture reasoning
        let traced_llm = TracedLLM::new(agent.llm.clone(), span.clone());

        // Execute task with traced LLM
        let result = agent.execute_with_llm(task, traced_llm).await?;

        // Extract reasoning from LLM responses
        let reasoning = self.extract_reasoning(&result)?;

        // Calculate metrics
        let tokens = result.token_usage;
        let cost = self.calculate_cost(&tokens, &agent.model);

        // Async quality evaluation (don't block response)
        let eval_task = self.evaluator.evaluate_async(task, &result);
        tokio::spawn(eval_task);

        let trace = AgentTrace {
            trace_id: span.trace_id(),
            task_id: task.id,
            duration_ms: span.duration_ms(),
            tokens_used: tokens,
            cost_usd: cost,
            model: agent.model.to_string(),
            reasoning_steps: reasoning,
            tool_calls: result.tool_calls,
            output: result.output,
            validation_result: result.validation,
        };

        // Store for later analysis
        self.reasoning_db.store(&trace).await?;

        Ok(trace)
    }

    // Query interface for debugging
    pub async fn debug_task(&self, task_id: &str) -> DebugReport {
        let traces = self.reasoning_db.get_traces_for_task(task_id).await;

        DebugReport {
            total_cost: traces.iter().map(|t| t.cost_usd).sum(),
            total_tokens: traces.iter().map(|t| t.tokens_used.total()).sum(),
            reasoning_tree: self.build_reasoning_tree(&traces),
            tool_calls: traces.iter().flat_map(|t| &t.tool_calls).collect(),
            quality_scores: traces.iter().filter_map(|t| t.quality_score).collect(),
            failure_points: traces.iter().filter(|t| t.validation_result.is_fail()).collect(),
        }
    }
}

Dashboard I actually use in production:

Agent Task: "Refactor auth module"
├─ Total Cost: $2.34
├─ Total Tokens: 87,432
├─ Duration: 4m 23s
├─ Quality Score: 8.2/10
├─ Agent Decisions:
│  ├─ Step 1: Analyzed codebase (Claude Sonnet, $0.45, 32k tokens)
│  │  └─ Reasoning: "Identified 3 performance bottlenecks"
│  ├─ Step 2: Generated refactor plan (GPT-5, $1.20, 45k tokens)
│  │  └─ Reasoning: "Will parallelize token validation and cache user roles"
│  ├─ Step 3: Wrote tests (Claude Haiku, $0.12, 8k tokens)
│  │  └─ Reasoning: "Using cheaper model for straightforward task"
│  └─ Step 4: Applied changes (Claude Sonnet, $0.57, 12k tokens)
│     └─ Reasoning: "Need context awareness for merge conflicts"
└─ Validation: Tests passed, performance improved 3x

Critical observability that Kubernetes doesn't need:

Why did the agent make this decision? (reasoning trace)
What would it have done differently with a different model? (A/B testing agents)
Is the output quality degrading over time? (model drift detection)
Which subtasks were expensive vs. valuable? (ROI per reasoning step)

6. Dynamic Task Decomposition: The Recursive Scheduling Problem

Kubernetes: Static Workload Definition

# You declare the full workload upfront
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3  # Known at deploy time
  template:
    spec:
      containers:
      - name: worker
        image: worker:v1

Scheduler knows exactly how many pods to create.

Agent: Runtime Task Decomposition

# Agent doesn't know how many subtasks until it analyzes the problem

task = "Migrate our monolith to microservices"

# Agent reasoning:
# 1. First, analyze codebase to identify service boundaries
initial_analysis = agent.analyze(codebase)

# 2. Based on analysis, dynamically spawn decomposition
# Agent decides: "I found 7 service boundaries"
subtasks = [
    "Extract user service",
    "Extract auth service",
    "Extract payment service",
    # ... 4 more services
    "Update API gateway routing",
    "Update deployment pipelines"
]

# 3. Some subtasks spawn further subtasks at runtime
for subtask in subtasks:
    sub_agent = spawn_agent(subtask)
    result = sub_agent.execute()

    if result.requires_further_breakdown:
        # Recursive decomposition - didn't know this upfront
        even_more_subtasks = sub_agent.decompose_further()
        for sub_sub_task in even_more_subtasks:
            spawn_agent(sub_sub_task)

Orchestration challenges:

Resource reservation is impossible
- Kubernetes: Reserve N pods worth of CPU/memory upfront
- Agents: Don't know how many sub-agents until runtime
- Can't predict total cost or token usage
Circular dependencies detected at runtime

   Task: "Implement feature X"
   ├─ Agent A: "Need to update schema"
   │  └─ Spawns Agent B: "Design new schema"
   │     └─ Agent B: "Need to know data access patterns"
   │        └─ Spawns Agent C: "Analyze current queries"
   │           └─ Agent C: "Need to understand schema"  ← CIRCULAR!
   │              └─ Tries to spawn Agent B again...

Conflicting subtask outputs

   Task: "Optimize database performance"
   ├─ Agent A: "Add index on user_id"
   ├─ Agent B: "Denormalize user table"  ← Conflicts with A's approach
   └─ Agent C: "Move to NoSQL"  ← Conflicts with both A and B

   # All three agents working in parallel, don't know others' decisions
   # Merge agent needs to reconcile contradictory approaches

Our Solution: Hierarchical Task Graphs with Constraints

pub struct HierarchicalTaskGraph {
    root: TaskId,
    nodes: HashMap<TaskId, TaskNode>,
    constraints: Vec<Constraint>,
}

#[derive(Clone)]
enum Constraint {
    // Resource constraints
    MaxConcurrentTasks(usize),
    MaxTotalCost(f32),
    MaxTreeDepth(usize),  // Prevent infinite recursion

    // Logical constraints
    MutualExclusion(Vec<TaskId>),  // Can't run simultaneously
    RequiredSequence(Vec<TaskId>),  // Must run in order
    DeduplicationKey(String),  // Prevent duplicate tasks

    // Quality constraints
    RequireHumanApproval(Predicate),  // High-risk tasks need approval
}

pub struct TaskNode {
    id: TaskId,
    parent: Option<TaskId>,
    children: Vec<TaskId>,
    depth: usize,  // Track recursion depth

    // Runtime decomposition tracking
    decomposed: bool,
    decomposition_reasoning: Option<String>,
}

impl HierarchicalTaskGraph {
    // Agent requests to spawn subtask
    pub async fn request_subtask_spawn(
        &mut self,
        parent: TaskId,
        subtask_desc: String,
    ) -> Result<TaskId, SpawnError> {
        // 1. Check depth constraint (prevent infinite recursion)
        let parent_depth = self.nodes[&parent].depth;
        if parent_depth >= self.constraints.max_depth() {
            return Err(SpawnError::MaxDepthExceeded);
        }

        // 2. Deduplication check
        let dedup_key = self.compute_task_hash(&subtask_desc);
        if self.task_exists_with_key(&dedup_key) {
            return Err(SpawnError::DuplicateTask);
        }

        // 3. Check resource constraints
        let concurrent = self.count_running_tasks();
        if concurrent >= self.constraints.max_concurrent() {
            return Err(SpawnError::ResourceExhausted);
        }

        let estimated_cost = self.estimate_subtask_cost(&subtask_desc);
        let total_cost = self.total_cost_so_far() + estimated_cost;
        if total_cost >= self.constraints.max_cost() {
            return Err(SpawnError::BudgetExceeded);
        }

        // 4. Check for circular dependencies
        if self.would_create_cycle(parent, &subtask_desc) {
            return Err(SpawnError::CircularDependency);
        }

        // 5. Create subtask
        let task_id = self.create_task(TaskNode {
            parent: Some(parent),
            depth: parent_depth + 1,
            description: subtask_desc,
            ..Default::default()
        });

        // 6. Check approval constraints
        if self.requires_approval(&task_id) {
            self.mark_awaiting_approval(task_id);
        }

        Ok(task_id)
    }

    // Detect circular dependencies
    fn would_create_cycle(&self, parent: TaskId, new_task_desc: &str) -> bool {
        // Check if new task's description matches any ancestor
        let mut current = Some(parent);
        while let Some(node_id) = current {
            let node = &self.nodes[&node_id];

            // Semantic similarity check (not just exact match)
            if self.task_similarity(&node.description, new_task_desc) > 0.85 {
                return true;  // Likely circular
            }

            current = node.parent;
        }
        false
    }
}

Example: Preventing Runaway Task Explosion

// Without constraints:
Task: "Build a web scraper"
├─ Agent A: "First, build HTTP client"
│  └─ Agent B: "Implement connection pooling"
│     └─ Agent C: "Optimize socket management"
│        └─ Agent D: "Implement custom TCP stack"  ← WAY too deep
│           └─ Agent E: "Write network driver"  ← INSANE

// With constraints:
constraints = [
    MaxTreeDepth(3),  // Max 3 levels of decomposition
    RequireHumanApproval(|task| task.estimated_cost > 10.0),
    MaxTotalCost(50.0),
];

// Now:
Task: "Build a web scraper"
├─ Agent A: "First, build HTTP client"
│  └─ Agent B: "Implement connection pooling"
│     └─ Agent C: "Actually, just use reqwest crate"  ← Depth limit hit, chose pragmatic solution
└─ Agent D: "Write parser for target site"

7. The Merge Problem: Reconciling Parallel Agent Outputs

Kubernetes: No Merge Problem

Pods don't modify each other's state. Service A and Service B are independent.

Agents: Constant Merge Conflicts

// Task: "Refactor codebase for performance"
// Decomposed into 3 parallel agents:

Agent A output:

diff
// auth.rs

fn validate_token(token: &str) -> bool {
expensive_crypto_check(token)
}
fn validate_token(token: &str) -> bool {
CACHE.get_or_insert(token, || expensive_crypto_check(token))
}


Agent B output:

diff
// auth.rs

fn validate_token(token: &str) -> bool {
expensive_crypto_check(token)
}
async fn validate_token(token: &str) -> Result {
expensive_crypto_check_async(token).await
} // Made it async for better concurrency


Agent C output:

diff
// auth.rs

fn validate_token(token: &str) -> bool {
expensive_crypto_check(token)
}
fn validate_token(token: &str) -> bool {
// Refactored crypto library
new_fast_crypto::check(token)
}


**All three modified the same function with incompatible changes.**

**Merge strategies I've tried:**

**1. Sequential execution (kills parallelism)**

rust
// Simple but slow
let result_a = agent_a.execute().await;
let result_b = agent_b.execute_with_context(result_a).await;
let result_c = agent_c.execute_with_context(result_b).await;


**2. LLM-based merge agent (works surprisingly well)**

rust
struct MergeAgent {
llm: LLM,
}

impl MergeAgent {
async fn merge_outputs(
&self,
original: &str,
outputs: Vec,
) -> Result {
let prompt = format!(
r#"
Original code:

        ```
        {original}
        ```

        Three agents proposed these changes:

        Agent A (caching optimization):

        ```diff
        {diff_a}
        ```

        Agent B (async conversion):

        ```diff
        {diff_b}
        ```

        Agent C (library upgrade):

        ```diff
        {diff_c}
        ```

        These changes conflict. Merge them into a single implementation that:
        1. Preserves all optimizations where possible
        2. Maintains semantic correctness
        3. Resolves conflicts by choosing the best approach

        Explain your reasoning for conflict resolution.
        "#,
        original = original,
        diff_a = outputs[0].diff,
        diff_b = outputs[1].diff,
        diff_c = outputs[2].diff,
    );

    let response = self.llm.generate(prompt).await?;

    // Extract merged code and reasoning
    let merged = self.extract_code(&response)?;
    let reasoning = self.extract_reasoning(&response)?;

    // Validate merge
    if !self.validate_merge(&merged).await? {
        return Err(MergeError::InvalidMerge);
    }

    Ok(merged)
}

}


**3. Conflict detection + human escalation (production approach)**

rust
impl MergeOrchestrator {
async fn merge_with_conflict_detection(
&self,
outputs: Vec,
) -> MergeResult {
// 1. Detect conflicts
let conflicts = self.detect_conflicts(&outputs).await;

    if conflicts.is_empty() {
        // No conflicts - simple merge
        return self.simple_merge(outputs).await;
    }

    // 2. Categorize conflicts
    let categorized = self.categorize_conflicts(conflicts);

    for conflict in categorized {
        match conflict.severity {
            Severity::Trivial => {
                // E.g., formatting differences - auto-resolve
                self.auto_resolve(conflict).await;
            }
            Severity::Semantic => {
                // E.g., different algorithms - try LLM merge
                match self.llm_merge(conflict).await {
                    Ok(merged) => continue,
                    Err(_) => {
                        // LLM couldn't resolve - escalate
                        self.request_human_resolution(conflict).await;
                    }
                }
            }
            Severity::Critical => {
                // E.g., contradictory business logic - always escalate
                self.request_human_resolution(conflict).await;
            }
        }
    }

    MergeResult::PartialMerge {
        merged: self.get_merged_outputs(),
        pending_conflicts: self.get_pending_conflicts(),
    }
}

}


**Real production example:**

Task: "Optimize our API gateway"

Agent A: "Removed rate limiting (it's causing latency)"
Agent B: "Tightened rate limiting (we're getting DoS attacks)"
Agent C: "Moved rate limiting to edge CDN (best of both worlds)"

Merge conflict: A and B are contradictory
Resolution: Human reviewed, chose C's approach
Lesson: Some conflicts need domain expertise to resolve




---

## Conclusion: why this matters for Infrastructure engineers

If you're building agent systems in production, you can't treat them like stateless services. The orchestration challenges are fundamentally different:

1. **Non-determinism requires semantic checkpointing**, not just process restart
2. **Failure detection needs outcome validation**, not just exit codes  
3. **Resource scheduling is multi-objective optimization**, not bin packing
4. **State is distributed across LLM context and external systems**, requiring explicit side-effect tracking
5. **Observability must capture reasoning**, not just execution metrics
6. **Task decomposition is recursive and dynamic**, requiring depth limits and deduplication
7. **Parallel agent outputs require intelligent merging**, not just process isolation

I built AgentFlow to address these challenges with:
- **Hierarchical task graphs** with runtime constraints
- **Semantic checkpointing** for failure recovery
- **Multi-model scheduling** with cost/quality/latency tradeoffs
- **Reasoning traces** for debugging agent decisions
- **Conflict detection and resolution** for parallel agent outputs

The paradigm shift: **Kubernetes orchestrates processes. Agent orchestrators orchestrate reasoning.**

---
If you're building production agent systems and hitting these problems, I'd love to hear how you're solving them. Find me on [Twitter](https://x.com/Siddhant_K_code) or [GitHub](https://github.com/Siddhant-K-code).

Serverless economics: why Cloud Run crushes App Runner (until it doesn’t)

Siddhant Khare — Mon, 20 Oct 2025 04:10:35 +0000

This analysis is based on official pricing documentation and straightforward cost calculations.

TL;DR

Pricing: Cloud Run is dramatically cheaper for short-running workloads (up to 17x cost difference)
AWS Integration: App Runner provides native ecosystem integration worth considering
Scaling: Cloud Run offers true scale-to-zero; App Runner keeps memory always-on
Break-even point: ~20 hours/day runtime

The Price Differential Nobody Talks About

When evaluating serverless container platforms, most discussions focus on features. Let's focus on what actually matters: cost and architectural trade-offs.

Running 1 vCPU + 2GB memory in the Asia region:

Daily Runtime	Cloud Run	App Runner	Difference
2 hours	$1.04	$17.82	17.1x
4 hours	$7.31	$22.68	3.1x
8 hours	$24.62	$32.40	1.3x
12 hours	$39.93	$42.12	1.05x
24 hours	$85.07	$71.28	App Runner wins

The cost reversal happens around 20 hours/day of continuous operation.

Architectural Differences That Drive Pricing

Cloud Run: True Serverless

Built on Knative with request-driven scaling:

Cost Model:
- vCPU: $0.000024/vCPU-second
- Memory: $0.0000025/GiB-second
- Billing granularity: per-second
- Free tier: 180,000 vCPU-sec, 360,000 GiB-sec monthly

Scale-to-zero: When idle, you pay nothing. This is the key differentiator.

App Runner: Hybrid Provisioning

Memory-always-on + CPU-on-demand model:

Cost Model:
- Provisioned (Memory): $0.009/GB-hour (always charged)
- Active (CPU): $0.081/vCPU-hour (only during processing)

Always-on memory: Base cost of $12.96/month for 2GB, regardless of usage.

The Math

Cloud Run: 2 Hours Daily Operation

Monthly runtime: 2h × 30d = 60h = 216,000 seconds

Resource consumption:
- vCPU: 1 vCPU × 216,000s = 216,000 vCPU-sec
- Memory: 2 GiB × 216,000s = 432,000 GiB-sec

After free tier:
- vCPU billable: 216,000 - 180,000 = 36,000 vCPU-sec
- Memory billable: 432,000 - 360,000 = 72,000 GiB-sec

Charges:
- vCPU: 36,000 × $0.000024 = $0.864
- Memory: 72,000 × $0.0000025 = $0.180
Total: $1.044

App Runner: 2 Hours Daily Operation

Fixed memory cost (always-on):
2GB × $0.009/GB-hour × 24h × 30d = $12.96

CPU cost (usage-based):
Monthly CPU runtime: 2h × 30d = 60h
1 vCPU × 60h × $0.081/vCPU-hour = $4.86

Total: $12.96 + $4.86 = $17.82

The $12.96 fixed cost is the critical factor. You're paying for memory reservation whether you use it or not.

When app runner makes sense

1. AWS ecosystem lock-in strategy

If you're running:

RDS databases
ElastiCache clusters
S3 storage
VPC-internal services

Example configuration:

services:
  api:
    image: your-app:latest
    environment:
      DATABASE_URL: postgresql://rds-endpoint
      S3_BUCKET: your-bucket
    instance_role: arn:aws:iam::account:role/AppRunnerRole

IAM-based authentication eliminates credential management overhead. VPC connector provides direct private subnet access.

2. VPC Integration requirements

App Runner → VPC Connector → Private Subnet → RDS

Cloud Run → Internet → Cloud SQL Proxy → Cloud SQL

App Runner's VPC integration is simpler for private resource access.

3. Predictable performance requirements

No cold starts. Memory is always provisioned. Response time predictability matters for SLA-driven services.

4. 24-Hour operation

At 24/7 runtime, App Runner is actually cheaper ($71.28 vs $85.07).

5. Operational consolidation

Single cloud provider strategy reduces:

Multi-cloud operational overhead
Cross-cloud networking complexity
Team training requirements
Security policy fragmentation

6. Simplified auto-scaling

# App Runner
auto_scaling_configuration:
  max_concurrency: 100
  max_size: 10
  min_size: 1

# Cloud Run (more verbose)
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "10"
        autoscaling.knative.dev/minScale: "0"
        run.googleapis.com/cpu-throttling: "false"

When Cloud Run is the Clear Winner

1. Cost-Constrained Projects

Early-stage startups, MVPs, personal projects. $1.04/month vs $17.82/month is a 17x difference that compounds across multiple services.

2. Irregular/Low-Frequency Traffic

Batch jobs (few daily executions)
Webhook endpoints
Development/staging environments
Demo applications

True scale-to-zero means zero cost during idle periods.

3. Google Cloud Native Integration

Native integration with:

Firebase
BigQuery
Google Workspace APIs
Cloud Storage

4. Geographic distribution

Cloud Run supports more regions for global deployment.

Cold start reality check

Cold start times vary significantly by runtime and application design.

Based on typical container startup patterns:

Lightweight Node.js: 500ms-1s
Heavy JVM applications: 3-5s
Minimum instance configuration can mitigate this

App Runner's always-on memory eliminates cold starts entirely.

Migration path considerations

Cloud Run → GKE: Relatively straightforward due to Knative foundation.

App Runner → EKS: Requires more significant architectural changes.

If Kubernetes migration is in your roadmap, Cloud Run provides a smoother path.

Decision framework

Choose Cloud Run if:
- Runtime < 12 hours/day
- Cost is primary constraint
- Traffic is sporadic/unpredictable
- Google Cloud ecosystem alignment

Choose App Runner if:
- Runtime > 20 hours/day
- AWS ecosystem consolidation
- VPC integration critical
- Cold start sensitivity
- Predictable performance required

The uncomfortable truth

Cloud Run's pricing advantage is undeniable for short-running workloads. The 17x cost difference at low utilization is architectural, not operational.

However, infrastructure decisions require considering:

Long-term operational complexity costs
Team expertise and training overhead
Integration friction across cloud boundaries
Compliance and security policy alignment

For AWS-committed organizations, paying 3x more might be strategically rational when factoring in operational efficiency.

For new projects with flexible infrastructure choices, Cloud Run's economics are compelling.

What I Don't Know

These require real-world usage data:

Long-term pricing stability (both vendors adjust pricing)
Network egress costs at scale
Support response quality differences
Enterprise discount negotiation leverage

For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

How to make AI code edits more accurate

Siddhant Khare — Fri, 15 Aug 2025 05:07:54 +0000

A technical examination of production-grade LSP-MCP integration

After spending months analyzing AI coding tools in production, I've become convinced that most solutions fundamentally misunderstand the structural nature of code. They treat source files as text with light syntactic awareness, missing the rich semantic relationships that make code comprehensible to experienced developers. Serena MCP Server, built by Oraios AI, represents a different approach, one that leverages the mature Language Server Protocol ecosystem to give AI systems the same structural understanding that powers modern IDEs.

The fundamental problem: Semantic vs Syntactic code understanding

The current generation of AI coding tools relies heavily on Retrieval-Augmented Generation (RAG) with vector embeddings. While effective for broad semantic search ("find authentication-related code"), RAG fails catastrophically at structural code analysis. Consider this scenario:

def calculate_total(items: List[Item]) -> Decimal:
    # Implementation A - in payment processing

class ShoppingCart:
    def calculate_total(self) -> Decimal:
        # Implementation B - in cart management

def calculate_total(order_lines):
    # Implementation C - legacy implementation

RAG will find all three functions when searching for "calculate_total", but cannot determine:

Which implementation handles tax calculations
How changing the method signature affects downstream callers
Whether a specific call site refers to the instance method or standalone function
The complete call hierarchy for each implementation

This isn't a failure of RAG, it's a fundamental limitation of semantic similarity search when applied to structured, symbolic systems like programming languages.

LSP as the Foundation: Why Language Servers Matter

Language Server Protocol, standardized by Microsoft in 2016, solves exactly this problem through static analysis. LSP implementations parse code into abstract syntax trees, build symbol tables, and maintain cross-references between definitions and usage sites. This enables precise operations like "find all references" that understand scope, inheritance, and overloading.

The key insight is that LSP provides structural understanding while RAG provides semantic understanding. These are complementary, not competing approaches.

Serena's architecture leverages the multilspy library to interface with language servers across multiple languages. This isn't a reimplementation of language analysis, it's a carefully designed abstraction layer over battle-tested language servers like pylsp (Python), typescript-language-server, rust-analyzer, and gopls.

MCP integration: protocol design decisions

The Model Context Protocol integration reveals several thoughtful architectural choices:

Transport Layer: stdio vs SSE

Serena supports both stdio and Server-Sent Events (SSE) transports. The stdio approach follows MCP conventions where the client spawns the server as a subprocess:

uvx --from git+https://github.com/oraios/serena serena start-mcp-server --transport stdio

However, Serena also supports SSE mode for environments where subprocess management is problematic:

serena start-mcp-server --transport sse --port 9121

This dual-transport design addresses real deployment constraints. In containerized environments or when dealing with permission boundaries, SSE can be more reliable than stdio subprocess communication.

Process isolation and resource management

The implementation includes a local dashboard (localhost:24282) that's more than a convenience feature, it's a critical operational component. Since many MCP clients fail to properly clean up subprocesses, the dashboard provides manual shutdown capability and real-time logging.

The recent migration from FastAPI to Flask (v0.6.0) eliminated asyncio cross-contamination issues between the MCP server and dashboard components. This change removed the need for process isolation and non-graceful shutdowns on Windows, a concrete example of how framework choice affects system reliability.

Tool architecture: Symbol-level operations

Serena exposes its capabilities through MCP tools that operate at the symbol level rather than text level. Key tools include:

ReplaceSymbolBodyTool: Replaces function/class implementations while preserving signatures
InsertAfterSymbolTool/InsertBeforeSymbolTool: Positional insertion relative to symbols
GetCodeMapTool: Generates hierarchical code structure maps
SearchForPatternTool: Pattern-based code search with LSP context

The implementation handles edge cases that naive text manipulation would miss:

# InsertAfterSymbolTool handles files not ending with newlines
# ReplaceSymbolBodyTool preserves indentation context
# SearchForPatternTool respects gitignore patterns

These tools maintain code formatting and structure automatically, reducing the cognitive load on LLMs that would otherwise struggle with precise text manipulation.

Memory system and project indexing

Serena implements a persistent memory system in .serena/memories/ directories. This isn't just caching, it's a designed knowledge accumulation system. During initial project onboarding, Serena:

Indexes the entire codebase using language servers
Builds cross-reference databases
Identifies key architectural patterns
Stores project-specific context for future sessions

The indexing process is asynchronous and happens in a background thread queue, ensuring immediate MCP server responsiveness while building comprehensive project understanding.

Language support: Direct vs Indirect

Serena's language support demonstrates pragmatic engineering:

Direct support (fully tested):

Python (pylsp)
TypeScript/JavaScript (typescript-language-server)
Java (note: slow startup, especially on macOS)
Rust (rust-analyzer)
Go (gopls)
C/C++ (clangd)
PHP (php-language-server)

Indirect support (untested but theoretically functional):

Ruby, C#, and other languages supported by multilspy

This tiered approach acknowledges the reality of language server ecosystem maturity while providing extension points for additional languages.

Production considerations

Security model

The Docker deployment provides security isolation for shell command execution:

docker run --rm -i --network host \
  -v /path/to/your/projects:/workspaces/projects \
  ghcr.io/oraios/serena:latest serena start-mcp-server --transport stdio

Volume mounting limits filesystem access scope while network host mode ensures language server communication works correctly. The container approach also eliminates local language server installation requirements.

Performance characteristics

Several design decisions optimize for large codebase performance:

Lazy language server initialization: Servers start only when needed for specific languages
Incremental indexing: Only modified files trigger re-indexing
Symbol table caching: LSP responses are cached to avoid repeated analysis
Background task queue: Tool executions are serialized to prevent resource contention

Integration patterns

The MCP architecture enables multiple integration patterns:

IDE integration: Direct MCP support in VSCode, Cursor, IntelliJ
Chat clients: Claude Desktop, Claude Code with free tier support
Custom frameworks: Tool abstraction allows integration with LangGraph, pydantic-ai, etc.
Web clients: mcpo bridge for ChatGPT and other non-MCP clients

Critical limitations and trade-offs

Serena inherits the fundamental limitations of static analysis:

Dynamic behavior: Runtime code generation, reflection, and metaprogramming remain invisible
Cross-language boundaries: FFI calls and inter-process communication aren't tracked
Configuration-driven behavior: Dependency injection and configuration-based routing can't be analyzed
Test coverage gaps: Dynamic test discovery may miss runtime test generation

The system also makes deliberate trade-offs:

Completeness over speed: Full project indexing provides accuracy but requires upfront time investment
Precision over recall: LSP-based analysis misses some relationships but ensures high confidence in reported relationships
Local over cloud: On-device analysis ensures privacy but limits available computational resources

Why this architecture matters

Serena represents a maturation point in AI coding tools. Rather than building yet another vector database, it leverages decades of language tooling investment. The LSP ecosystem already solved structural code analysis, Serena makes this capability available to LLMs through a clean protocol boundary.

The MCP integration is equally thoughtful. By implementing both stdio and SSE transports, supporting multiple client types, and providing operational tooling, Serena addresses real deployment constraints rather than just the happy path.

Most importantly, Serena's tool design acknowledges that LLMs and static analysis have complementary strengths. LLMs excel at semantic understanding and natural language intent parsing. Static analysis excels at precise structural relationships and impact analysis. The architecture exploits both strengths without trying to force one approach to handle everything.

This is what production-grade AI tooling looks like: principled architecture, thoughtful integration, and clear boundaries between components with different strengths.

For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

An easy way to stop Claude code from forgetting the rules

Siddhant Khare — Wed, 02 Jul 2025 20:45:23 +0000

You spend time setting up Claude Code with specific instructions in your CLAUDE.md file. Maybe you want it to always ask for confirmation before creating files, or to follow particular coding workflows. It works perfectly for the first few exchanges.

Then something changes. By the fourth or fifth interaction, Claude Code starts ignoring your rules. It stops asking for confirmation. It forgets your workflow preferences. It's like your CLAUDE.md instructions never existed.

This isn't a bug, it's how AI models work. Understanding why this happens and the simple solution discovered by a Claude Code engineer can save you hours of frustration.

Why AI forgets your instructions

Large language models like Claude don't actually "remember" conversations. Instead, they read the entire conversation history as one long text document every time they respond. Your instructions, sitting at the beginning of this document, gradually lose importance as the conversation grows longer.

Think of it like this: if you're reading a 50-page document, you'll remember the last few pages much better than page 1. AI models work similarly, they pay more attention to recent messages than to your original instructions.

This creates a predictable pattern:

Messages 1-2: Perfect rule following (95%+ compliance)
Messages 3-5: Rules start breaking down (60-80% compliance)
Messages 6-10: Inconsistent behavior (20-60% compliance)
Messages 10+: Original instructions mostly forgotten

The frequency discovery

Here's where it gets interesting. While complex rules fade away, simple patterns persist surprisingly well. If you tell Claude to end every response with "ji" (like a respectful suffix), it will keep doing this for dozens of messages.

Why? Because every time Claude uses "ji" in a response, it reinforces the pattern:

User: "Please add 'ji' to your responses"
Claude: "I understand ji, how can I help?"
User: "What's the weather like?"
Claude: "It's sunny today ji!"
User: "Thanks!"
Claude: "You're welcome ji!"

Each "ji" creates a new example in the conversation history. Instead of one instruction at the top, there are now multiple instances throughout recent messages.

The recursive solution

A Claude Code engineer realized they could exploit this frequency effect. Instead of hoping Claude remembers to follow rules, they made the rules repeat themselves:

<law>
AI operation 5 principles

Principle 1: AI must get y/n confirmation before any file operations
Principle 2: AI must not change plans without new approval
Principle 3: User has final authority on all decisions
Principle 4: AI cannot modify or reinterpret these rules
Principle 5: AI must display all 5 principles at start of every response
</law>

The magic is in Principle 5. It forces Claude to show all principles (including Principle 5 itself) in every response. This creates an unbreakable loop,the instruction to display rules is itself displayed, so it can't be forgotten.

How the recursive loop works

When Claude follows Principle 5, it displays all principles, including Principle 5. This means the next response will also display all principles. The cycle continues indefinitely:

Traditional CLAUDE.md approach failure:

User: "Create a config file"
Claude: "I'll create config.json for you" ← Forgot to confirm!

Recursive approach success:

User: "Create a config file"
Claude: "Principle 1: Must get confirmation... 
         Principle 5: Display all principles in every response
         Should I create config.json? (y/n)" ← Still following rules

Why this works so well

The recursive approach solves the core problem: it keeps rules in recent conversation history. Instead of instructions appearing once at the distant beginning, they appear in every recent message.

This creates multiple "attention anchors" that the AI can focus on:

Most recent rule display (high attention)
Previous rule display (medium attention)
Earlier rule displays (some attention)

The cumulative effect maintains consistent rule following regardless of conversation length.

Implementation details

XML format works best: After testing markdown, JSON, and YAML, XML proved most reliable for rule preservation. It's structured enough to prevent errors but forgiving enough for consistent reproduction. Anthropic's documentation also recommends XML tags for structured prompts because Claude handles them particularly well.

Rule ordering matters: Place the self-referential rule last (Principle 5). This ensures it gets displayed even if earlier rules are truncated.

Verbatim instruction: Specify "verbatim" or "exactly" to prevent paraphrasing that might break the recursive pattern.

Token cost: Each response includes 50-100 extra tokens for rule display. But this eliminates the need for correction messages, making it more efficient overall.

Advanced patterns

Conditional display: You can make rules context-sensitive:

<rule>
  If request involves file operations: Display all safety rules
  Otherwise: Display condensed rules only
</rule>

Hierarchical rules: Different rule sets for different situations:

<meta_rules>Always display meta_rules and current_context_rules</meta_rules>
<safety_rules>Rules for file operations, API calls, etc.</safety_rules>
<style_rules>Rules for formatting, tone, etc.</style_rules>

Getting started with Claude Code

Here's a minimal CLAUDE.md template to try:

<behavioral_rules>
  <rule_1>Always confirm before creating or modifying files</rule_1>
  <rule_2>Report your plan before executing any commands</rule_2>
  <rule_3>Display all behavioral_rules at start of every response</rule_3>
</behavioral_rules>

Test it by having a 10+ exchange coding session and see if the rules persist. Adjust the rules based on what behaviors you most need to maintain in your development workflow.

When to use this approach

Best for:

File operations requiring confirmation
Code generation workflows
Multi-step development tasks
Long Claude Code sessions where rule adherence matters

Not necessary for:

Simple questions with short responses
One-off code snippets
Exploratory conversations

The bigger picture

This recursive technique reveals something important about working with AI: frequency beats complexity. Instead of writing elaborate instructions once, simple rules repeated consistently work better.

As AI systems become more capable and handle more important tasks, techniques like this become essential. They transform unreliable assistants into dependable tools that maintain consistent behavior.

The recursive approach isn't just a clever hack, it's a foundation for building trustworthy AI workflows. When your AI assistant needs to follow specific procedures, this technique ensures it actually does.

Works everywhere, not just Claude

This isn't just a Claude Code fix. It works for any LLM that responds to prompt structure: GPT, Gemini, Mistral, whatever. The principle is universal across all transformer-based language models.

The fundamental rule: If it's not in the output, it won't stay in context. If it's not in context, it gets forgotten.

This applies whether you're using:

ChatGPT for coding assistance
Gemini for research tasks
Mistral for content generation
Local models like Llama or Qwen

The recursive pattern exploits how all these models handle attention and context. They all suffer from the same instruction decay problem, and they all respond to the same frequency-based solution.

The specific XML format might need slight adjustments for different models, but the core principle, making rules display themselves, works universally. It's not about Claude's architecture; it's about the fundamental nature of how language models process sequential text.

For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.