DEV Community: Siddharth Bhalsod

When RAG Is the Wrong Answer for Your LLM System

Siddharth Bhalsod — Mon, 20 Jul 2026 09:31:48 +0000

Ask ten engineering teams how they're grounding their LLM in company knowledge, and nine will answer retrieval before you finish the question. Not because they ran the comparison. Because retrieval is what the tutorials default to, what the frameworks assume, and what nobody has to defend in a design review.

That's not an architecture decision. That's a reflex.

Fine-tuning and long context aren't fallback options for when RAG gets weird in production. They answer a different question. Where does the information your model needs actually live. Get that question wrong, and no amount of chunking strategy or reranker tuning fixes it later.

The Question Everyone Skips

Every running program keeps the data it needs in one of three places. Baked into the binary at compile time. Loaded into memory before the program starts executing. Or fetched from disk the moment something asks for it.

LLM architecture makes the same three choices, just under different names. Fine-tuning bakes information into the model's weights. Long context loads it into working memory for a single request. Retrieval fetches it from an external index on demand. Seen this way, "RAG or fine-tuning" stops looking like a fair fight between two competitors. It's a decision tree missing its first branch, the one nobody asks out loud: does this even belong in the binary at all.

Fine-Tuning Is a Compile Step, Not a Database

Fine-tuning changes what a model does, not what it knows in any durable sense. That distinction gets lost constantly. Teams fine-tune a model on last quarter's product catalog expecting it to "know" the catalog, then wonder why it hallucinates SKUs from two versions ago.

FineTuneBench, a benchmark built specifically to test this, found fine-tuning succeeded at absorbing new factual updates only about 19% of the time. Not because the training ran badly. Because weights aren't a lookup table. They're a compressed, lossy compile step over whatever examples you fed in, and compression is a bad medium for facts that need to be exact and current.

What fine-tuning is genuinely good at is behavior: tone, output format, refusal patterns, domain-specific reasoning style. LoRA and QLoRA lowered the entry cost enough that the old "you need thousands of examples" rule doesn't hold anymore. For classification or format-locking tasks, 200 to 500 curated examples is often sufficient. Together AI offers this as managed infrastructure, tuning and inference in one stack, which suits teams who don't want to own GPU orchestration. Axolotl, the config-driven open-source framework, suits teams who want every knob exposed and version-controlled. Neither platform has a workflow for "add one new fact." That absence is the tell. Fine-tuning assumes you're compiling behavior, not writing to a database.

The strongest commercial case for fine-tuning right now isn't knowledge injection at all. It's distillation: taking a frontier model's behavior on a narrow task and compressing it into a smaller, cheaper open-weight model. That's a real, defensible use of the technique. Using it to keep a knowledge base current is not.

Long Context Is Preloading, Not a Free Lookup

The pitch for long context sounds like it eliminates the whole problem. Gemini 3.1 Pro and Gemini 3 Flash ship a million-token window. Claude's frontier models sit at the same mark. Just paste the whole knowledge base in and skip the architecture entirely.

Here's what that pitch leaves out: a million-token window doesn't mean a million tokens of equally reliable recall. Independent RULER-style evaluations in 2026 still show recall sagging well before models reach their stated ceiling, the same lost-in-the-middle effect that shows up at 50,000 tokens shows up again, just later, at whatever fraction of the window your corpus happens to occupy. And a study comparing retrieval against long context directly (LaRA, arXiv 2502.09977) found RAG still beating long-context approaches by 6 to 38% for smaller models, even at a relatively modest 128,000-token window. The advantage long context has is real, but it's concentrated in frontier-scale models with the budget to run them, not a universal upgrade.

This isn't a contradiction of the context rot argument from the token budgeting piece earlier in this series. It's the same mechanism, wearing a different hat. A window you fill because your corpus genuinely fits comfortably under where recall degrades is a legitimate design choice. A window you fill because the vendor advertises a million tokens and it's easier than building an index is the same context rot, just with a bigger invoice attached, since every one of those tokens gets reprocessed on every single call.

Retrieval's Real Advantage Isn't Quality. It's Decoupling.

Retrieval doesn't win because it's smarter than the other two. It wins because it's the only one of the three where the size of your knowledge base is decoupled from the cost of using the model. Add a thousand new documents to a RAG index and nothing retrains, nothing reprocesses. Add them to a fine-tuned model or a long-context prompt and you're paying, in training compute or in reprocessed tokens, for every addition.

That's why retrieval is the correct default for information that changes faster than monthly, which is the guidance most vendor documentation converges on: retrieval for grounding in facts that move, fine-tuning for behavior that should hold steady. It has nothing to do with retrieval producing better answers in the abstract. A well-tuned model with no retrieval will out-answer a badly-chunked RAG pipeline every time.

The flip side rarely gets said out loud: if your knowledge base is small and mostly static, the "fetch" step retrieval requires is overhead you're paying with no corresponding benefit. An index, a retriever, and a reranker are three new systems to operate and three new failure modes to debug, for a corpus that would have fit inside a single prompt.

Why Teams Default to Retrieval Anyway

The technical case above is not why most teams end up with RAG. The organizational case is simpler and less comfortable to say out loud. Retrieval lets a team ship without anyone on the call defending a tradeoff.

Fine-tuning requires someone to own a training pipeline, which means admitting the team needs ML infrastructure it may not have budgeted for. Long context requires someone to accept, explicitly, that recall degrades as the prompt grows, which means putting a number on an accuracy tradeoff in a meeting where nobody wants to be the one who signed off on it. Retrieval requires neither confession. The tradeoffs are still there, buried in chunking decisions and reranker configs, but they don't have to be said in the room where the architecture gets approved.

That's an incentive structure, not a technical argument, and it's exactly why the choice gets made once, in a sprint, by whoever set up the vector database first, and then never gets revisited even after the corpus and the query patterns have changed underneath it.

The three approaches aren't a ranked list with retrieval on top and the others as consolation prizes. They're three different places to keep the same information, and the only question worth asking is how often the underlying facts change, how large a model you can afford to run, and whether you can absorb reprocessing cost on every call. A default chosen once, in a sprint, and never revisited is still an architecture decision. It's just an accidental one, made by whoever happened to be in the room.

Retrieval as a First-Class Context Operation, Not a RAG Afterthought

Siddharth Bhalsod — Mon, 13 Jul 2026 16:02:47 +0000

A team ships retrieval-augmented generation in a single sprint. A vector database, an embedding call, a similarity search wired into the existing chat endpoint. The demo works. Support tickets get answered with citations attached, and everyone moves on to the next feature.

Six weeks later, the wrong document keeps winning. Not because the model got worse. Because nobody decided, on purpose, how the corpus should be split, searched, or ranked before the first real user typed a question. Those decisions got made anyway, by whatever the tutorial's defaults happened to be, and now they're load-bearing.

That's what "RAG as an afterthought" actually looks like in production. Not a missing feature. A set of irreversible decisions nobody remembers making.

The Feature That Isn't a Feature
Most teams reach for retrieval the way they'd reach for a caching layer: bolt it in front of the model, ship it, tune it later if something looks off. Early retrieval-augmented deployments really were built this way, added mainly to cut down on hallucinated answers rather than as a core design choice. That framing has aged badly. As retrieval-heavy systems move into workflows where a wrong answer costs money or triggers an audit, the technique stops behaving like a feature you can swap and starts behaving like a decision you already made, whether you meant to or not.

Article 3 argued you should start infrastructure with the data layer, not the model. This is a narrower, sharper version of that same instinct. Inside the data layer itself, retrieval isn't a call you make. It's a schema you commit to. Chunking, the process of splitting source documents into retrievable units, is the clearest example of what that means in practice. Decide it once, at design time, and every future query lives inside that decision, whether the team realizes it or not.

The Decisions That Are Actually a Schema
Retrieval quality problems get blamed on the model. Most of the time the real cause was set months earlier, in how the corpus got chunked and indexed. Fixed-size splitting, cutting every document into equal token counts, is the fastest way to get a RAG demo working and the slowest way to get it right for anything beyond a demo. Structure-aware splitting, breaking on headings or function boundaries, and hierarchical chunking, indexing a small child chunk for precise matching alongside a larger parent chunk for context, both consistently outperform naive fixed-size splitting in production benchmarks. The exact numbers vary enough across studies that no single ranking should be treated as settled, but the direction of the finding holds across most of them.

What the studies do agree on is the tension hierarchical chunking is built to solve. Small chunks retrieve precisely but arrive at the model missing surrounding context. Large chunks preserve context but dilute the similarity signal that made them findable in the first place. Frameworks like LlamaIndex ship hierarchical retrieval as a built-in pattern specifically because enough teams kept rebuilding a version of it by hand that it stopped making sense as custom code.

None of this is a setting anyone flips casually. Changing chunk strategy after launch means re-embedding and reindexing the entire corpus, not editing a config file. That's the schema comparison earning its keep. A database team wouldn't redesign a primary key structure as a Tuesday afternoon task, and a retrieval system's chunking strategy deserves the same caution, because it constrains every query the system will ever be asked to answer.

The Coupling Nobody Names
Chunk size, embedding model, and reranker don't operate independently, even though most architecture diagrams draw them as three separate boxes. Change the embedding model and the ideal chunk size shifts underneath it. Add a reranker and the number of candidates worth pulling in the first retrieval pass changes too. Production teams that treat these as three independent settings tend to discover the coupling the hard way. A routine embedding model upgrade quietly degrades answer quality, and nobody connects it back to a reranker still tuned against the old embeddings.

The fix isn't clever configuration. It's hybrid retrieval, combining dense vector search with a sparse keyword method like BM25 and merging the two ranked lists, paired with a dedicated reranker such as Cohere Rerank or an open cross-encoder like bge-reranker. Hybrid retrieval earns its cost because pure semantic search is weakest exactly where structured data is strongest: part numbers, error codes, proper nouns, the kind of tokens where an exact match beats a fuzzy one. Vector store choice mostly follows from scale rather than preference. pgvector inside an existing Postgres instance covers most teams under five to ten million vectors, and dedicated stores like Qdrant or Weaviate earn their added complexity once filtering and scale genuinely demand it.

This coupling is also why retrieval needs its own evaluation discipline, not one borrowed wholesale from general model evals. Article 4 covered eval systems as a sequencing problem across an entire pipeline. Retrieval needs a narrower version of that same layer: a check that scores whether the retrieved chunks were relevant at all, before anyone measures whether the final generated answer sounded right.

The Fork That Decides the Whole System's Shape
Underneath the tuning details sits a bigger, earlier choice. Naive single-pass retrieval, hybrid retrieval with reranking, or agentic retrieval, where the model itself decides whether to search again, aren't three tiers of the same architecture. They're three different systems, with different cost curves and different failure modes, and picking one after the others are already built means replacing the orchestration layer, not upgrading it.

The cost gap alone should settle most of the decision before a single line of code gets written. Naive retrieval runs at a fraction of a cent per query. Hybrid retrieval with reranking adds a modest per-query cost for a real jump in precision. Agentic retrieval, where the model plans, retrieves, evaluates, and sometimes retrieves again, can run ten to a hundred times the cost of the naive version, because a single user question might trigger several retrieval rounds instead of one.

One documented production overhaul shows the gap between architecture and afterthought directly. A support-knowledge-base deployment moved from naive single-pass retrieval to hybrid search with reranking and query rewriting, and its retrieval precision score moved from roughly half to over four-fifths in the process. That's one case, not a universal constant, but the direction is consistent with the broader research: architecture choice tends to dominate the outcome more than model choice does.

Choosing agentic retrieval also means choosing an orchestration story, not just a smarter loop. The common production pairing in 2026 is a retrieval-focused framework like LlamaIndex handling chunking, hybrid search, and reranking, wired into an orchestration layer like LangGraph managing the decision loop on top of it. Bolt that on after a naive pipeline already has real users, and the migration looks less like adding a feature and more like replacing the foundation while the building stays open.

What Retrieval Architecture Actually Costs Later

None of this stays technical for long. A founder who approves "agentic RAG" without knowing it means a per-query cost that scales with usage, not a flat line item, is going to be surprised by an invoice. A team that can't point to which chunk an answer actually came from is going to struggle the first time a regulator or an enterprise customer asks for an audit trail. A retrieval system nobody owns, tuned once during a hackathon and never revisited, degrades the same way an unmonitored database does. Quietly, until someone notices the answers have been subtly wrong for a while.

The teams that get this right treat retrieval the way they treat their production database: someone owns it, changes to it go through review, and its behavior gets measured on a schedule instead of when a customer complains. The teams that get it wrong treat it as a feature ticket, assigned to whoever happened to be free that sprint.

Nobody schedules a sprint to add a database after the product already has users depending on one shape of data. Retrieval deserves the same instinct. The chunking decision, the embedding model, the index structure: these are schema, not settings. Get them wrong on day one, and the fix six weeks later isn't a config change. It's a migration, with the downtime and risk that word implies for anyone who has run one at 2 a.m.

The team that ships a single similarity search call this sprint isn't ahead of the team that spent two weeks designing a retrieval architecture first. They're just borrowing time from a migration they haven't scheduled yet.

Context Compaction Patterns: When to Summarize, Truncate, or Retrieve

Siddharth Bhalsod — Wed, 08 Jul 2026 08:27:14 +0000

Context Compaction Patterns: When to Summarize, Truncate, or Retrieve

A context window fills up mid-session. Something has to leave. Most systems answer that problem with one move: summarize everything and hope the important parts survive. It's the same instinct that made "just add more context" the reflexive answer to the budgeting problem — a single technique, applied everywhere, regardless of what's actually being thrown out.

Compaction isn't one operation. It's three, and they destroy information in completely different ways. Summarizing keeps the gist and loses the precision. Truncating keeps the recent and drops the old outright. Retrieving loses nothing permanently, it just stops sitting in front of the model until something asks for it again. Pick the wrong one for a given piece of context, and the failure doesn't announce itself as a compaction bug. It shows up as the model forgetting a decision it already made, misreading a file it already read, or confidently inventing an API it saw an hour ago.

The Mistake Hiding Inside "Just Summarize It"

Ask an engineer how their system handles a full context window and the answer is usually one sentence: it summarizes. That's the imposter definition of compaction, and it's wrong in a specific way. Summarization is one tool in a set of three, best suited to exactly one kind of content: information where the gist matters more than the wording, and where there's no cheaper way to get it back later.

Truncation is not a worse version of summarization. It's a different bet entirely, that the old content isn't coming back and doesn't need a lossy stand-in either, because it's either genuinely stale or cheaply re-obtainable from somewhere else. Retrieval is not a fallback for when summarization would lose too much. It's the option that applies whenever the real thing still exists somewhere addressable, and fetching it again costs less than carrying a compressed copy around indefinitely.

Treat all three as interchangeable synonyms for "make it smaller," and a system will compact the wrong things the wrong way. A four-line grep result and a forty-turn architectural discussion do not fail the same way when compressed. Building one pipeline for both is the actual root cause of context management that quietly breaks agent behavior weeks into production, long after anyone remembers the shortcut was taken.

The One Question That Decides Which Technique to Use

There's a single question that sorts almost every piece of discardable context into the right bucket: if this disappeared right now, could you get it back, cheaply, from somewhere else?

If yes, retrieve. The content lives in a file, a database, a vector index, a URL. The model doesn't need to carry it, it needs to know it exists and how to ask for it again. Compressing it into a summary is strictly worse than dropping it and re-fetching the original when it's actually needed, because a summary of a file is a lossy copy of something you already have a lossless copy of.

If no, but the value sits in the narrative rather than the exact wording, summarize. A ninety-minute debugging conversation isn't reconstructable from any external source. Nobody logged the reasoning trail. But the decisions made along the way, the constraints established, the dead ends already ruled out, survive compression into a few paragraphs. The exact phrasing of turn forty doesn't matter. The fact that turn forty ruled out a database migration does.

If no, and the content has already been superseded by something more recent, truncate. An old grep result against a file that's since been edited isn't worth summarizing. It isn't worth a placeholder either. It's just wrong now, and the right move is to let it go rather than spend tokens compressing something that's no longer true.

The Trigger Matters As Much As the Technique

Even the right technique executed at the wrong moment causes damage. Wait until a conversation is nearly out of room before summarizing it, and the summary is built from an already-degraded view: the model is compressing a conversation it's already struggling to reason over cleanly. Run the same operation earlier, well before pressure builds, and the summary comes from a clean, complete picture instead of salvage material.

The same logic holds for the other two techniques. Truncate too early, before content is genuinely stale, and something still in use disappears with no summary to fall back on. Retrieve too late, after the model has already tried to answer from what it half-remembers instead of the real source, and the retrieval fixes an error that already happened instead of preventing it. Category tells you which technique to use. Timing tells you whether that technique does what it's supposed to.

Same System, Three Different Answers

The clearest proof this isn't theoretical is a single coding agent making three different calls inside one session. Claude Code doesn't run one compaction function against everything that gets long. It runs three, for three different problems. Old tool results that are no longer needed, a grep from many turns ago against a file that's since changed, get cleared outright: no summary, no reference kept, because the content is already stale and reconstructing it would mean re-running a command against a file that no longer matches. Large but still-relevant tool output, a long file read, a verbose test run, gets a different treatment: the full content moves to disk, and only a path reference stays in the model's view, pulled back in full if the work touches it again. Conversation history gets a third treatment entirely, because there's no disk copy of a debugging conversation sitting anywhere: when the session nears its limit, the turns themselves get compressed into a structured summary that preserves decisions and task state rather than exact wording.

Three techniques, three categories, inside one product. Cursor draws a similar line somewhere else. Ask it about a codebase and it doesn't truncate the repository or summarize old files to make room, it retrieves. The codebase gets indexed with embeddings ahead of time, and a request pulls back the handful of chunks that are actually relevant through semantic search, the same underlying bet as Claude Code's disk-backed tool outputs: the source is stable and cheaper to fetch again than to carry around compressed.

What Breaks When You Pick Wrong

The failure mode of picking the wrong technique rarely looks like a crash. It looks like a subtly worse answer that's hard to trace back to its cause. Summarize a tool output that needed to be reasoned over exactly, a JSON schema, a diff, an error stack, and the model starts working from its own paraphrase of that structure instead of the structure itself. It will still sound confident. It will just be wrong in ways that are expensive to catch, because the failure surfaces downstream of the compaction, not at the moment it happened.

Truncate something that hadn't actually been superseded yet, and the model loses a fact it still needed, with no summary to fall back on and no signal that anything is missing. This is the quieter failure. There's no error message, just a gap where a constraint used to be, and the next few turns build on an incomplete picture before anyone notices the output is already wrong.

Most teams don't choose the wrong technique through bad judgment. They choose it through convenience. Summarize-everything is the easiest thing to implement, one function call regardless of what's being compressed. Truncate-the-oldest is the second easiest, because it needs no model call at all. Retrieval is the one that gets skipped most often, not because it's technically harder, but because it requires deciding, in advance, that some piece of context deserves to live outside the conversation entirely, addressable rather than carried. That decision is architecture. It has to be made before the system is under pressure, not improvised the moment the window fills up.

Every compaction is a bet that the part being kept matters more than the part being thrown away. Most systems make that bet the same way for everything they discard. The ones that hold up under real use make it three different ways, on purpose, category by category, before they're ever forced to.

Building a Context Budget: A Practical Token Allocation Framework

Siddharth Bhalsod — Mon, 29 Jun 2026 10:10:47 +0000

Open the /context command in Claude Code and you'll see something most teams have never seen for their own product: a precise breakdown of where every token in the session went. System prompt, eight percent. Tools and skills, fourteen percent. Conversation history, sixty-one percent. Free space, seventeen percent. The command exists because someone decided a context window needed a statement, the same way a company needs a P&L.

Most teams building on top of these models have no equivalent. They know their token limit. They have almost no idea what's spending it.

That's not a tooling gap. It's a discipline gap. The teams getting hurt by context rot are rarely the ones with a context window that's too small. They're the ones who never asked where each token was going, or whether it had earned its place there.

This is the practice most teams skip, and the one worth building before you touch a single percentage.

What a Token Budget Actually Is

Most people use "token budget" to mean the ceiling: the advertised window size on the model's spec sheet. That's not a budget. It's a credit limit. A credit limit tells you the most you could spend. It says nothing about where the money is actually going.

A context window is capacity. The budget is the layer above it: the deliberate decision about what fills that capacity, in what proportion, reviewed by someone whose job it is to ask whether each slice still deserves the space. Capacity and governance are different problems, and conflating them is how teams end up with a 1-million-token window and the same quality complaints they had at 50,000.

The reason this matters operationally is that the categories competing for that capacity are in zero-sum competition with each other. Pull ten retrieved documents at 1,500 tokens apiece into a RAG pipeline and you've spent 15,000 tokens before the model has read the actual question. Every one of those tokens came out of the same fixed pool that conversation history, tool output, and your system instructions are also drawing from. Add more of one category and something else gets less. There's no way around that math, only choices about how to make it on purpose instead of by accident.

The Ceiling Is Not the Budget

A bigger window doesn't fix a budgeting problem. It postpones the moment the problem becomes visible, and it usually makes the eventual mess larger.

When Claude Code moved to a default 1-million-token context window, several practitioners who'd been getting strong results on the 200,000-token version reported the opposite of an upgrade. More room didn't mean better reasoning. It meant more space for stale tool output, abandoned approaches, and old file reads to accumulate without anyone noticing, because nothing forced a cleanup the way a tighter ceiling used to. One developer found his sessions actually improved after deliberately shrinking back down, treating the extra headroom the way you'd treat a hard drive with ten times the space: a place where clutter survives longer, not a reason to stop sorting it.

The practical version of this lesson: window size is RAM, not storage. Treat it like a constrained resource regardless of how large the number on the spec sheet is. A bigger number doesn't earn you the right to be lazier about what loads.

Nobody Owns the Four Categories

Every context window, regardless of product, splits into roughly four things competing for the same fixed space: system instructions, retrieved knowledge, conversation or session history, and tool output. In most teams, every one of those four grows on its own, because no single person is accountable for the total.

System instructions grow by accretion. Someone adds a clarifying line after a bad output, another engineer adds a guardrail after an incident, and six months later the system prompt is a record of every edge case anyone has ever hit, most of which will never recur. Retrieved knowledge grows because "pull a few extra chunks to be safe" feels free and almost never gets measured against whether the extra chunks changed the answer. Conversation history grows because trimming it feels like throwing away memory, even when most of what's being kept is no longer relevant to the current step. Tool output is the worst offender precisely because it's the least visible: a single page snapshot, a pulled list of records, or a raw API response can carry far more raw text than the model needs in that form, and unless something intercepts it, all of it sits in context anyway.

A recent review of agent memory architectures put the underlying issue plainly: the context window is the scarcest shared resource any agent system has, and how it gets allocated is a coordination problem that no single piece of the system can solve by itself. That's the part most teams miss. Budgeting context isn't an engineering task you assign to one function. It's a cross-cutting one, and cross-cutting problems are exactly the ones that go unowned by default.

Budgeting Is a Practice, Not a Percentage

The honest fix here isn't a universal split. Anyone offering "40 percent retrieval, 30 percent history, 20 percent tools, 10 percent system" as a template that fits every product is selling something that won't survive contact with a real codebase or a real support queue. A coding agent's ideal allocation looks nothing like a customer-support bot's, and a fixed-system-prompt template that worked for one will misallocate the other.

What actually transfers across products is the practice, not the numbers. Name the four categories explicitly. Give one person, not a committee, the job of asking whether each category still earns its share. Review it on a cadence, the same way a team reviews a cloud bill line by line rather than just checking they're under the annual cap. A cap tells you nothing about which service is burning the budget. A line-item review does.

Some teams are already doing pieces of this without naming it. Cursor's .cursorignore file is a budgeting decision made before the fact: entire categories of files are told they will never compete for context at all, rather than being added and then managed once they're already taking up space. When Cursor's agent needs to search broadly across a codebase, it can spawn a separate subagent with its own context window just for that search, so raw results never spend tokens out of the main conversation's budget. That's a team deciding, explicitly, that one category of work deserves its own ledger rather than sharing the main one.

Claude Code's /context breakdown is the other half of the same idea: a dashboard that exists specifically so someone can see the split before a session runs long enough to degrade. The dashboard isn't the discipline. Running it before every long session is.

Why This Doesn't Show Up on a Dashboard

The cost of skipping this practice doesn't throw an error. It shows up as a slow, undramatic decline. A support agent starts needing three exchanges where one used to do. A coding agent begins re-deciding things it already decided an hour earlier in the same session. Nobody gets paged, because nothing crashed. The decline gets blamed on the model, because the model is the part of the system anyone can name. The actual cause, an unaudited and unowned allocation of tokens, doesn't show up on any dashboard a team is currently watching.

This is the same root issue that's run through this series in different clothes. Wrong build order showed up as eval infrastructure nobody trusted. Here it shows up as context nobody owns. Both are organizational problems wearing technical costumes. The fix in both cases is the same shape: assign the question to a person, on a schedule, before the system grows large enough that no one can audit it by hand anymore.

A context window doesn't tell you when it's full of the wrong things. It just gets quietly worse and keeps answering anyway.

A token budget you never check isn't a budget. It's just a limit you haven't hit yet.

Context Window Economics: Why Your Token Budget Is a Product Decision

Siddharth Bhalsod — Mon, 22 Jun 2026 06:26:35 +0000

A model advertising a 200,000-token context window can start falling apart at 50,000 tokens. It won't throw an error. It won't flag a warning. It will just get worse, fluently and confidently, on a problem you can only catch by checking its work later.

This isn't a hypothetical. Chroma's 2025 research tested eighteen frontier models, including OpenAI GPT-4.1, Claude Code Opus 4, and Google Gemini 2.5, and found that every one of them degraded as input length grew, often well before the window was anywhere near full. Researchers gave this a name: context rot, a continuous decline in output quality that has nothing to do with hitting a limit.

Most teams don't know this exists. They treat the context window like disk space: bigger is better, and if something feels off, the fix is to feed the model more. A full ticket history instead of a summary. An entire file instead of the relevant function. Three months of chat logs instead of last week's. Each addition feels like it should help. Often it makes things worse, and nothing in the product tells you that it did.

The Capacity Myth
The assumption underneath most AI products is simple: more context means a smarter, more grounded system. Stack the tokens, the model wins.

The research says otherwise. In 2023, Stanford University researchers led by Nelson Liu tested how models handle multi-document question answering and found a U-shaped accuracy curve. Models retrieve information well when it sits at the start or end of the input. When the relevant fact lands in the middle, accuracy drops by more than 30 percent. They called it the lost-in-the-middle effect, and it has since replicated across six model families.

Chroma's broader study identified two more mechanisms compounding the problem. Attention dilution comes from how transformer attention scales: a 100,000-token input creates roughly 10 billion pairwise relationships competing for the model's focus. Distractor interference is sneakier. Content that's topically related but irrelevant doesn't just sit there harmlessly. It actively misleads the model toward the wrong answer.

Context window overflow is a hard stop: tokens get truncated, and you find out immediately. Context rot is the opposite. Performance erodes while everything still appears to be working, which is exactly why most teams don't think they have the problem.

None of this requires hitting the window's stated limit. A model with 200,000 tokens of capacity can show measurable degradation by 50,000. The decline is continuous, not a cliff, and that's exactly why it goes unnoticed until someone checks.

The real metric was never capacity. It's signal-to-noise: how much of what's sitting in the window actually deserves the model's attention right now.

The Budget Nobody Is Managing
Once signal-to-noise is the metric, the context window stops looking like storage and starts looking like a budget. Every product built on an LLM allocates a fixed, shared space across competing claims on that attention:

The system prompt and instructions
Retrieved context: documents, files, search results
Conversation history
Tool outputs: API responses, file reads, search results
Working memory the model uses mid-task

A team that decides to retrieve the top 20 documents instead of the top 5, or to keep full chat history instead of summarizing it, is deciding what the model pays attention to instead of something else. That's the same category of decision as choosing which features make a release. Most teams don't experience it that way. They experience it as a retrieval setting somebody configured once and never revisited.

This is also where the five-dimension framework from earlier in this series gets concrete. Context budget lives inside the interaction model dimension: not whether a product feels conversational, but whether what actually reaches the model at each turn is the right input for the job. A system can score well on every other dimension and still fail here, because nobody assigned ownership of what goes into the window.

The size of the bill is the easiest part of this to track, and the part most teams actually do track. API pricing runs largely per token, so a retrieval step that grabs 30 documents instead of 8 isn't more thorough, it's a standing tax on every call the product makes. At a million queries a month, the gap between a tight context and a bloated one shows up on the margin line, not the engineering backlog.

The Silent Failure, Again
We've made a version of this argument before in this series: eval failures are usually sequencing problems, not metric-selection problems. Skip the detection layer, and a system degrades invisibly until a user notices. Context bloat fails the same way, one layer over, and it needs the same kind of production monitoring this series has already argued for, just watching a different signal.

No error fires when context degrades. The response still looks plausible. It's fluent, it's confident, and it's often wrong in a way only the user catches, because the system has no idea its own attention has been diluted.

Picture a support agent fed a customer's full ticket history instead of a structured summary. Forty messages in, it recommends a fix it already tried, that the customer already rejected. The model didn't get a worse weight update overnight. Its context just got noisier, and nobody was watching for it.

How AI Native Products Actually Manage the Budget
Cursor and Claude Code make a useful contrast, because both face the worst version of this problem: real codebases generate far more text than any context window can comfortably hold.

Claude Code's answer is auto-compaction. As a session approaches its limit, it summarizes the older exchanges and replaces the raw history with a condensed record before forcing a hard reset. Practitioners tracking this closely have found that compacting earlier, around 60 to 75 percent of capacity rather than waiting for the automatic trigger near 95 percent, produces longer and higher-quality sessions. One developer monitoring usage independently found Claude Code reporting only 10 percent of capacity left while his own tracking showed 64 percent had actually been used, a 54-point gap traced back to deliberately conservative compaction thresholds. The unused space wasn't waste. It was headroom protecting signal-to-noise as the session went on.

Cursor takes a different angle on the same problem. Instead of expanding the window to fit a growing codebase, it indexes the codebase into a vector database and retrieves only the chunks semantically relevant to the current query. A natural-language question pulls back a handful of relevant files instead of the entire repository, assembled into context just for that request. The model never sees more than it needs, by design, not by accident.

Neither product asks how much it can fit. Both ask what deserves to be there. That's the actual skill underneath context window economics, and it has nothing to do with how large the advertised window is.

Why This Belongs on the Roadmap
There's a trust cost to context rot that's quieter than the API bill, and harder to track. Users rarely file a bug report that says the model's context got noisy around message forty. They just notice the product got worse over a long session, stop trusting it with anything that matters, and churn without explaining why.

The model didn't get dumber. Its context got noisier.

Ownership matters here too. If the context budget isn't anyone's explicit job, the way conversion rate or onboarding flow is somebody's job, it drifts by default toward "more," because more feels safer than a decision someone has to defend. The instinct to fix a struggling AI feature by feeding it more is the same instinct as fixing it by upgrading to a bigger model. Both substitute scale for design. In most cases, the actual fix is a smaller, better-curated context, and that's a decision that belongs with whoever owns the product, not buried in a retrieval config nobody revisits after launch.

Every additional token placed in front of a model is a decision about what it should attend to instead of something else. Treat it with the discipline of a feature cut or a pricing tier, not a default setting left over from launch week. The teams getting this right don't have the biggest context windows. They decided, on purpose, what doesn't get to be there.

Building a Production-Grade LLM Eval System From Scratch

Siddharth Bhalsod — Tue, 16 Jun 2026 10:08:13 +0000

Your LLM Eval Is Broken Before You Write the First Test

Most teams discover their eval system is broken the same way. They ship a prompt change that improves tone but silently tanks accuracy on edge cases. They upgrade their model version and something subtly changes — response length, citation patterns, how it handles ambiguity. Nobody catches it because the test suite was checking the wrong things. Or it wasn't running in CI. Or it existed on someone's laptop and that person has since left.

This is not a metrics problem. It is a sequencing problem.

The teams with working eval infrastructure — the ones where a prompt change doesn't become a post-mortem — built their system in a specific order. They defined what good looks like before they wrote a single test. They instrumented the system before they had enough data to justify it. They treated evaluation as architecture, not as a final validation step bolted on before launch.

In the AI Native series, Article 3 established that most teams build the wrong stack because they start with the model and work backward. The same mistake compounds inside the eval layer: most teams start with a framework and work backward. They install DeepEval or Braintrust, run a quick hallucination check, ship it, and call the eval layer done. The framework is not the system. The framework is one component inside a system that has to be deliberately designed.

This article is the design guide for that system. Not a framework tutorial — a sequencing blueprint.

The Wrong Starting Point

When a team decides to "add evals," the first thing they typically reach for is a library. pip install deepeval. Add AnswerRelevancyMetric. Run it against a few test cases. Green outputs feel like progress.

They are not progress. They are the illusion of instrumentation.

The problem is that answer relevancy is a generic metric. It tells you whether the model's response is topically related to the query — which is almost always true for any reasonably sized model and any reasonably coherent prompt. Passing this metric by default is like testing whether your e-commerce site can render a product page and calling the checkout flow validated.

The real question is not "does this output look relevant?" The real question is: what does quality actually mean for this specific system, in this specific product context, for this specific user?

That question is not a technical question. It is a product question. And it has to be answered before any eval framework is touched.

Layer One: Define Quality Before You Measure It

Consider two products that both use retrieval-augmented generation. The first is a legal research tool — lawyers use it to find case precedents before drafting filings. The second is a customer support assistant — customers use it to resolve billing disputes without calling in.

Both systems retrieve documents. Both generate responses. Both could fail on hallucination and answer relevancy. But the quality definitions are completely different.

For the legal tool, the most dangerous failure is a confident answer that cites a real case incorrectly — a paraphrase that changes the meaning of a ruling. For the support tool, the most dangerous failure is a refusal to resolve something the system should be able to handle — a hedge that sends the customer to a human unnecessarily.

Run the same generic metric set on both and you will get a score. That score will mean nothing to either product team.

This is why quality definition is Layer 1. Not Layer 4. Not "something we add later when we have real data."

The way to do it: write three to five failure statements before you write any test. Not metric names — failure statements. Things like: "The system confidently states a legal precedent that does not exist," or "The system routes a resolvable billing dispute to a human agent." These statements describe what broken looks like in terms your product team and your eval framework can both understand.

Then map each failure statement to a metric type. Some will map to built-in DeepEval metrics. Some will require a custom GEval criterion. Some will require a deterministic code-based check. The mapping is the architecture decision.

Layer Two: Instrument Without Waiting for Data

The second failure mode: teams wait until they have enough production data to build a "real" test suite. This feels responsible. It is actually how you end up with no eval coverage during the months when the system is most likely to change.

The practical answer is synthetic goldens.

DeepEval's Synthesizer can generate test cases from your knowledge base before a single real user has touched the system. If you are building a RAG pipeline, you feed it your document corpus and it generates realistic input/output pairs — questions a real user might ask, grounded in the content the system will retrieve. These are not perfect proxies for real traffic. They are good enough to establish a baseline and to catch the class of failures that break obviously.

GitHub's Copilot team runs comprehensive offline evaluations against every model before it reaches production — testing across metrics like latency, accuracy, and response consistency before any user interaction. They do not wait for user feedback to tell them the model regressed. The eval system surfaces regressions in the same pipeline that builds the release.

The minimum viable starting point is not fifty production examples. It is twenty-five synthetic goldens, two to three metrics that map to your failure statements, and a passing threshold. That is a real eval system. Run it before every prompt change, every model swap, every retrieval parameter update.

Layer Three: Structure the Test Suite Around Failure Modes, Not Features

This is the architectural distinction most teams miss.

The natural instinct is to organize test cases around features: here are the tests for the summarization flow, here are the tests for the question-answering flow, here are the tests for the refusal behavior. This organization feels logical. It mirrors how the product is structured.

The problem is that eval systems organized by feature tell you what broke but not why. When the summarization score drops three points, you know summaries got worse. You do not know whether the retrieval layer is returning worse context, whether the prompt changed something in formatting behavior, or whether a model update shifted the generation style.

Structure the test suite around failure modes instead. Each failure statement from Layer 1 becomes a test class. Each test class runs its specific metric. When a test class fails, the failure message is already diagnostic — it points to the component and the behavior, not just the feature.

In DeepEval, this looks like:

`from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

confident_hallucination = GEval(
name="ConfidentHallucination",
criteria="The output should never state a legal precedent with high confidence unless the retrieved context directly supports it.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
threshold=0.8
)`

This is not the generic HallucinationMetric. It is a custom GEval criterion written in plain English, tied to the specific failure mode the legal research team identified. When it fires, it fires on a specific category of error — not on a score that requires interpretation.

DeepEval's recommendation, grounded in production experience, is to limit yourself to five metrics maximum: two to three generic system-specific metrics (contextual precision for a RAG pipeline, tool correctness for an agent) and one to two custom, use-case-specific metrics. The constraint is intentional. More metrics means noisier signals and harder-to-diagnose failures.

Layer Four: The Drift Problem No Test Suite Catches

There is a class of failure that well-designed test suites miss almost entirely. Call it version drift.

A model provider pushes a silent update. Not a new model version — an update to the weights behind the same model string. Your evals pass. Your prompts are unchanged. And quietly, over the following two weeks, something shifts. Users start submitting more corrections. Satisfaction scores drift down by a few points. The output that used to be crisp and structured gets slightly looser. Nobody changed anything. But the system got worse.

This is the failure mode that unit testing, however well structured, cannot catch. Offline evals run against a snapshot. They tell you whether the system performs acceptably on the dataset you created. They cannot tell you whether the live system is drifting from that baseline in production.

The answer is production monitoring — which is Layer 4 of the eval system and the layer most teams skip entirely.

Production monitoring means scoring a sample of real user interactions continuously using referenceless metrics. Referenceless because you will not have ground truth labels for live traffic. DeepEval provides these — metrics like AnswerRelevancyMetric, FaithfulnessMetric, and ConcisenessMetric that can run without a known correct answer.

The setup is straightforward: route ten to twenty percent of live traffic through your eval pipeline, aggregate scores on a rolling window, and alert when scores cross a threshold. Confident AI — the cloud platform built on top of DeepEval — handles the dashboard and monitoring infrastructure if you do not want to build it yourself. The point is not the tool. The point is that offline evals and production monitoring are two different systems solving two different problems, and you need both.

Teams that run only offline evals are flying blind during the longest part of a product's life: after launch.

The Build Order

The failure modes are not random. They follow directly from building the eval system in the wrong order.

Teams that instrument too late — after the system is in production — start with generic metrics and work backward to product meaning. They are always trying to retrofit quality definitions onto scores they do not fully trust.

Teams that organize by feature instead of failure mode always have a two-step debugging process: find the failing test, then figure out what the failing test actually means.

Teams that skip production monitoring ship a system that degrades invisibly until users tell them it has.

The right order is four layers, built in sequence:

Define quality as failure statements, before touching any framework.
Generate synthetic goldens and establish baselines, before waiting for real data.
Structure test classes around failure modes, not features.
Add production monitoring for drift, as a separate system from the offline test suite.

This is not how most teams build their eval layer. Most teams build Layer 2 first — the framework, the test cases, the CI run — and never get to Layers 1 and 4 at all.

The eval system that degrades invisibly is not a testing failure. It is a sequencing failure.

Most teams building AI products start with the model. That is the mistake. AI Native infrastructure has five layers, and none of them is the model - Read👇

Siddharth Bhalsod — Wed, 10 Jun 2026 08:41:36 +0000

Siddharth Bhalsod

Jun 10

What AI Native Infrastructure Looks Like in Practice

#ainative #aienhanced #aiautomation #aiops

8 min read

What AI Native Infrastructure Looks Like in Practice

Siddharth Bhalsod — Wed, 10 Jun 2026 08:37:45 +0000

Most teams get the model right on the first try.

They pick Claude or GPT-4o, wire up a few prompts, and ship something that impresses in a demo. Then they spend the next six months wondering why responses drift, costs compound faster than users do, and the system that felt clever in week two feels brittle by month four.

The model was not the problem. The model was never the problem.
The mistake is architectural, and it almost always starts the same way: teams choose the model before they design the data layer. Everything downstream from that sequencing error will cost them.

The Wrong Starting Point

Here is how most teams actually build: they pick a foundation model, write prompts that work for their test cases, and then figure out how to feed the model the right information. The data layer is treated as a support function for the model. A retrieval step to bolt on. Something to sort out later.

This is backwards.

In AI Native systems, the data layer is not a supporting actor. It is the foundation that determines whether the model can do anything useful at all. A well-prompted model operating on stale, poorly structured, or imprecisely retrieved data will underperform a weaker model operating on clean, fresh, semantically precise context. Every time.

What AI Native infrastructure actually looks like is five distinct layers, each with a specific job, and each dependent on the one below it. Start from the bottom, not the top.

Layer One: The Embedding Store

Before any user query fires, before any retrieval logic runs, data has to be prepared. Raw documents, knowledge bases, product catalogs, customer history — whatever domain knowledge the system needs — must be converted into vector embeddings and stored in a way that allows fast, semantically relevant retrieval.

This is the embedding store, and the choices made here reverberate through the entire system.

The first real decision is managed versus self-hosted. Pinecone is the category-default managed option: operationally simple, scales without tuning, and handles multi-region distribution natively. For teams that want full control over their infrastructure without a managed service dependency, Qdrant — built in Rust — delivers the lowest retrieval latency of any open-source vector database and handles complex metadata filtering cleanly. Weaviate sits in between: open-source, self-hostable, with native hybrid search that combines semantic and keyword retrieval without external tooling.

For teams already running Postgres, pgvector is worth a serious look before adding a dedicated vector database. Production-grade since the 0.7 release, it handles up to roughly 50 million vectors on a well-provisioned instance. The operational savings of not running a separate system are real, and the retrieval quality is equivalent to purpose-built options at that scale.

The second decision, less discussed and more consequential, is chunking strategy. How documents are split before embedding determines what the model can actually retrieve. Fixed-size chunks with no attention to semantic boundaries produce retrieval that regularly cuts a sentence in half, drops the precise clause that answers the query, or returns paragraphs that contain the right word in the wrong context. Semantic chunking — splitting on paragraph breaks, section boundaries, or structural signals within the document — consistently outperforms fixed-size approaches. It adds complexity upfront. It is worth it.

A third decision compounds: whether to use dense retrieval only (pure vector similarity) or hybrid retrieval that combines vector search with keyword matching. For domain-specific vocabularies — product codes, technical terms, proper nouns — pure semantic search regularly misses exact-match queries. Qdrant and Weaviate both offer hybrid retrieval that fuses dense and sparse scores without external tooling. For most production systems serving real users on real content, this is the right default.

Layer Two: The Retrieval Pipeline

The embedding store holds the vectors. The retrieval pipeline is the logic that decides which ones to surface, in what order, and in what form.
Most teams treat retrieval as a single step. Query comes in, nearest neighbors come back, those chunks go into the prompt. This works well enough in demos. In production, with real query distributions and real document variance, it degrades predictably.

Production retrieval pipelines have three stages:

Query transformation happens before the vector search. The user's literal input is rarely the best query to run against the embedding store. A user asking "how do I cancel?" might be best served by retrieving chunks about cancellation policy, refund terms, and account deletion procedures simultaneously. Rewriting the query, expanding it into multiple sub-queries, or using the conversation history to disambiguate intent before retrieval is the difference between a system that retrieves what the user typed and one that retrieves what the user meant.

Retrieval and re-ranking is the search step itself, followed by a second pass that re-scores the top candidates for relevance before passing them to the model. Bi-encoder models (the ones that power standard vector search) optimize for broad recall. Cross-encoder re-rankers optimize for precision among the top results. Running both — retrieve broadly with the bi-encoder, then re-rank the top 20 results with a cross-encoder before selecting the final context — produces meaningfully better retrieval quality than either approach alone, at a latency cost that is usually under 50 milliseconds.

Context assembly is the final step before the prompt. Which chunks to include, in what order, how to handle redundancy across chunks, whether to add metadata like document date or source type — these decisions shape what the model sees. Models perform better when the most relevant context appears at the beginning of the context window, not buried in the middle. Position matters more than engineers typically expect.

Layer Three: Context Management

This is where most teams discover that they had an implicit assumption they never examined.

They assumed context would stay small enough to not matter.
Context management is the layer that tracks what the model needs to know within a session, across sessions, and at the system level — and makes deliberate choices about what to include, what to compress, and what to discard. It sounds simple. In practice, it is the layer that silently determines whether the system feels coherent or amnesiac, expensive or cost-efficient.

The clearest failure mode is context stuffing: including everything the system might need, in full, on every request, because it is easier than deciding what to exclude. At low traffic volumes this is fine. At scale, the token cost compounds fast, latency climbs as the context window fills, and the model's attention degrades on long-context inputs. An enterprise application routing ten thousand requests per hour through a 128K context window, when 60K of that context is the same static background information repeated verbatim on every call, is not a data architecture problem — it is an engineering decision that has simply not been made yet.

Effective context management has three components. A session layer tracks the immediate conversation and recent user actions, kept compact, summarized aggressively after the first few turns rather than appended indefinitely. A memory layer handles what the system should retain across sessions — user preferences, prior decisions, domain-specific facts about this user's context — stored as structured records, not as raw conversation history. And a system layer manages the baseline context that every request needs: the product's core knowledge, current configuration, and any real-time state the model should be aware of.

The goal is not minimalism for its own sake. It is precision. The right context, fresh, in the right position, without padding.

Layer Four: The Eval Framework

Everything built so far produces outputs that cannot be tested with a passing or failing unit test. The model might return a factually correct response in the wrong format. It might answer the literal question while missing the user's actual intent. It might perform well on the examples in your test suite and drift on the long tail of real queries that you have not seen yet.

Eval infrastructure is what makes AI Native systems improvable, rather than just deployed.

The production pattern that engineering teams are converging on in 2026 uses two tools with a clear division of labor. A lightweight open-source framework handles CI/CD gating at the PR level: DeepEval is the closest thing the LLM eval world has to pytest, running assertion-style tests against model outputs on every code change. RAGAS handles retrieval-specific metrics — context precision, answer faithfulness, answer relevance — for RAG-heavy systems. These run in the pipeline, automatically, before any change ships.

A second tier handles production monitoring and regression tracking: Braintrust for dataset-first prompt regression workflows with human annotation, or Arize Phoenix for teams that need production observability alongside evaluation. The two tiers run together. Unit-level evals catch regressions before deployment. Production evals catch drift after it.

The discipline that separates teams who use evals from teams who have eval infrastructure is this: the metrics are defined before the system is built, not after. What does "correct" mean for this use case? What does "faithful" mean? What does "hallucinated" mean, specifically, for this domain? These are design questions, not measurement questions. Teams that get this right start their architecture work at the eval layer. Teams that get this wrong discover they cannot measure progress at the point when it matters most.

Layer Five: The Gateway

The LLM gateway is the layer that most teams add last. It should be among the first decisions made.

A gateway sits between your application and every model provider. It handles routing, cost controls, caching, failover, and observability — functions that are not optional at any meaningful production scale, but that most teams implement as ad hoc logic scattered across application code until a provider outage or a cost spike forces the issue.

At scale, the case is not theoretical. Teams running production AI workloads that skip this layer see token spend compound 30 to 40 percent faster than necessary from redundant identical requests that a semantic cache would have served without an inference call. They carry outsized operational risk during provider outages that proper failover configuration would absorb. They cannot attribute costs to teams or features because there is no central point of control.

Bifrost, an open-source gateway built in Go, handles 5,000 requests per second at 11 microseconds of overhead — low enough that it adds no perceptible latency to the inference call. LiteLLM is the most widely deployed open-source option for teams that want a Python-native solution with broad provider coverage. Cloudflare AI Gateway is the lowest-friction managed option for teams that want zero infrastructure maintenance. Kong AI Gateway integrates into existing API management infrastructure for enterprise environments already running Kong.

The right choice between them matters less than the decision to have one. Without a gateway, every team inevitably rebuilds fragments of it at the application layer: manual retry logic, cost tracking spreadsheets, per-feature model selection buried in function calls. The gateway consolidates that logic into a single, auditable layer. When a provider goes down at 2am, the failover runs automatically. When a new model releases and you want to test it on five percent of traffic, you change one configuration line.

The Right Build Order

The mistake is not in the individual layer choices. Most teams are thoughtful about which embedding store they pick, which eval framework they try. The mistake is in the order.

Teams that start with the model end up retrofitting the infrastructure around a system that was already making assumptions about what the data layer would eventually provide. The embedding store gets added to support a retrieval pattern that the prompt design has already locked in. The eval framework gets added when the system is already live and there is no baseline to regress against. The gateway gets added when the first cost spike arrives.

Teams that start with the data layer make different decisions. They define what "good retrieval" means before they write a prompt. They choose their embedding store based on the query patterns their system will actually need to support. They design the context management strategy before they know how often it will need to run.

The model sits at the top of this stack, not the bottom. It is the most visible layer. It is the layer that produces the output the user sees. But it is the last thing to configure, not the first.

Starting with the model is like designing a building by choosing the facade material before you know the load-bearing structure. The facade is what people will look at. The structure is what holds it up.

Build the structure first.

Most companies calling themselves AI Native are just fast. Strip the AI out of their product. If what remains is a slower version of the same thing, they never were AI Native. They were AI-augmented. Different architecture. Different ceiling. Read👇

Siddharth Bhalsod — Thu, 04 Jun 2026 05:50:59 +0000

Siddharth Bhalsod

Jun 4

Is Your System Actually AI Native? A 5-Dimension Scorecard

#ainative #aienhanced #aienabled #ainativescorecard

8 min read

Most teams think they are AI Native in some areas and not others. They are right. But they are wrong about which areas. A CTO told me his platform was fully AI Native. I asked five questions across five dimensions. Read👇

Siddharth Bhalsod — Thu, 04 Jun 2026 05:48:57 +0000

Siddharth Bhalsod

Jun 2

What Is AI Native? The One Test That Separates Real from Fake in 2026

#ainative #aienhanced #aienable #aipowered

6 min read

Is Your System Actually AI Native? A 5-Dimension Scorecard

Siddharth Bhalsod — Thu, 04 Jun 2026 05:47:15 +0000

Last month, a CTO told me his platform was "fully AI Native." I asked him five questions. By the third one, he stopped calling it that.

This is not a criticism of that CTO. His team built something impressive. They had a recommendation engine powered by GPT-4o, a natural language search bar, and an AI-generated insights dashboard. Real features, real value. But when I asked what happens when you swap out the AI model for a rules engine, the answer was: the product gets worse, but it still works. Every screen still loads. Every workflow still completes. The AI made things faster and smarter. It did not make things possible.

That is the line this article is about. Not the philosophical one from the first piece in this series, where we established the "remove the AI" test. This is the operational version. Five specific dimensions you can score your system against, today, to know whether you are genuinely AI Native or AI Augmented with good marketing.

(Continuing from: What Is AI Native? The One Test That Separates Real from Fake in 2026)

Why a Single Test Is Not Enough
The "remove the AI" thought experiment is useful as a gut check. It creates instant clarity. But it fails as a diagnostic tool for one reason: it treats AI Nativeness as binary when the architecture underneath is multi-dimensional.

A product can have an AI Native interaction model but an AI Augmented data layer. It can have an intelligence-first data architecture but a traditional team structure that bottlenecks every model update through a centralized ML team. These mismatches are where most companies actually live, and they are invisible to a single yes-or-no test.

The scorecard that follows is not theoretical. It comes from patterns visible across the companies building in this space right now, from how Cursor structures its editor around agent-first workflows to how Perplexity's entire data pipeline assumes AI will consume everything it stores. The dimensions are architecture, data layer, interaction model, improvement loop, and team structure. Score each one independently. The total tells you where you stand. The gaps tell you where to invest.

Dimension 1: Architecture - Where Does AI Live in Your Stack?
This is the structural question. Not "do you use AI?" but "where is it?"

Level 1 : Bolt-On. AI is called via external API at specific endpoints. The core application logic is deterministic. You could replace every AI call with a hardcoded response and the product would function, just without the smart parts. Most enterprise SaaS tools with AI features sit here. The CRM that generates email drafts. The project management tool that auto-categorizes tickets. Useful additions to products that existed before the AI arrived.

Level 2 : Integrated. A shared AI gateway or service layer exists. Multiple features route through it. There is some prompt management, maybe a shared embedding store. But the core product logic does not depend on model inference. If the AI layer goes down, the product degrades but does not die. This is where most companies that claim to be AI Native actually land.

Level 3 : Structural. AI is a first-class runtime component. Model inference sits in the critical path of the product's core loop. Remove it and the product does not degrade. It stops. Cursor operates here. Agent Mode, Background Agents, BugBot, the Composer workflow. These are not features layered on top of an editor. The editor is a coordination layer for AI agents working on your codebase. Cursor 3.0 shipped with up to eight parallel background agents, subagent fan-out via /multitask, and automations that trigger AI responses to events without developer intervention. The editor is the interface. The AI is the product.

Dimension 2: Data Layer - How Is Your Data Designed to Be Consumed?
This dimension is the one most teams underestimate. Your data layer reveals your real architectural assumptions more honestly than your pitch deck does.

Level 1 :Traditional. Relational databases and document stores optimized for application queries. AI reads from the same tables the application does. There is no data infrastructure specifically designed for model consumption. When a team at this level wants to add AI features, they write extraction scripts that pull data out of Postgres and push it into a model's context window. It works. It does not scale.

Level 2 : Dual-Purpose. Vector stores and embedding pipelines exist alongside the relational data. Some retrieval-augmented generation is in place. But the primary data access patterns are still application-driven. The AI infrastructure feels like a parallel system, not the primary one. Many teams that built RAG pipelines in 2024 and 2025 land here. They have embeddings. They have retrieval. But the vector store is a sidecar, not the spine.

Level 3 : Intelligence-First. The data layer assumes AI will consume it. Embeddings are not an afterthought. They are the primary representation. Context windows, retrieval pipelines, and evaluation datasets are first-class data artifacts, maintained with the same rigor as production database schemas. Perplexity operates at this level. Its entire data pipeline exists to feed the conversational search experience. There is no underlying "list of links" database that the AI queries. The data is structured for intelligence from the point of ingestion. When Perplexity indexes a source, it is not storing a URL and a title. It is creating a retrievable, citable, contextually embedable unit of knowledge.

Dimension 3: Interaction Model - How Do Users Interact With Intelligence?
The first article in this series introduced the command-based versus intent-based distinction. The scorecard makes it measurable.

Level 1 : Command + AI Assist. Users click, navigate, and fill forms. AI accelerates specific steps. Autocomplete, smart suggestions, draft generation. The user still drives. The AI co-pilots. Google Docs with Gemini sits here. You still open a document, position your cursor, and invoke the AI when you want help. The writing surface, the formatting tools, the collaboration model are all pre-AI constructs.

Level 2 : Hybrid. Some workflows are intent-based while others remain command-based. A product might let you describe a data analysis in plain language but still require you to manually configure the dashboard layout. Linear, the project management tool, is an interesting case at this boundary. You can describe what you want done in natural language, and the system will create issues and assign them. But the board structure, the workflow states, the team configuration are still manual command-based setup.

Level 3 : Intent-Native. The primary interaction is expressing intent. The system determines how to fulfill it. Users describe outcomes, not procedures. Claude Code is the cleanest example. There is no file tree to navigate. No editor pane to manage. You describe what you want the code to do. The agent writes code, runs tests, debugs failures, iterates across dozens of files, and presents the result. The entire development workflow reorganizes around expressing intent. Vercel's v0 takes a similar approach for frontend development. Describe the component you want. The system generates it, renders a live preview, and lets you iterate through conversation rather than through code.

Dimension 4: Improvement Loop - How Does the Product Get Smarter?
This is where the compounding advantage of AI Native architecture becomes visible. And where most self-assessments fall apart.

Level 1 : Ship to Improve. The product gets better when engineers ship features. AI model updates are manual, versioned, and infrequent. Someone on the team runs a fine-tuning job every quarter. Prompts are updated in code reviews. There is no automated evaluation of model quality, no systematic capture of user signals for improvement. This is the most common pattern, and it reveals a fundamental misunderstanding: treating AI components like static software instead of living systems.

Level 2 : Feedback-Informed. User signals are collected and inform model updates. Thumbs up and thumbs down on AI responses. Usage analytics on which suggestions get accepted. But the improvement still requires human-driven retraining cycles. The data flows in, gets analyzed, and eventually someone decides to update the prompts or retrain the model. The loop exists but it is not continuous.

Level 3 : Use to Improve. The product gets smarter when people use it. Evaluation loops, fine-tuning pipelines, and behavioral data create continuous learning without manual intervention. This is the level where the gap between AI Native and AI Augmented compounds over time. Cursor's codebase context system improves its suggestions the more you use it in a project. It reads your CURSOR.md file, your .cursorrules, your import patterns, your code style. The AI becomes more useful not because Anysphere shipped an update but because you used the product. The evaluation infrastructure at this level is not a nice-to-have. It is the core product mechanism. DeepEval, the open-source LLM evaluation framework, now supports over 50 research-backed metrics precisely because teams at Level 3 need automated quality measurement that catches drift before users do.

Dimension 5: Team Structure - How Is AI Expertise Distributed?
Architecture follows org charts. Conway's Law has not been repealed by large language models.

Level 1 : Centralized AI Team. A dedicated ML or AI team that other teams submit requests to. AI is a service organization. Product teams describe what they want, the AI team builds it, and the result gets integrated. This creates a bottleneck that looks exactly like the "data science team" bottleneck of 2018. Every AI improvement queues behind every other AI improvement.

Level 2 : Embedded Specialists. AI engineers sit within product teams. Better than centralized, because the AI expertise is closer to the product context. But the rest of the pod still thinks in traditional software terms. The AI engineer is the only one who understands prompts, evals, and model selection. When that person goes on vacation, the AI features freeze.

Level 3 : AI-Literate Pods. Small cross-functional pods of three to five people where everyone has AI literacy. Evaluation, prompt design, and model selection are shared responsibilities, not specialist skills. Industry practice in 2026 has converged on this model. Optimum Partners documented it in their engineering management research. Harvard Business Review described the product strategist role as requiring "a blend of technical depth, product thinking, governance, and human-AI collaboration skills." The pod does not have an AI expert. The pod is AI-literate.

Scoring It
Add your scores across all five dimensions. The total maps to three zones.

5 to 7: AI Augmented. AI is a feature layer. Your product works without it. That is a legitimate architectural choice that serves many businesses well. But it is not AI Native, and the strategic implications are different. Your competitive moat is product execution, not intelligence compounding.

8 to 11: AI Integrated. You are in transition. Some dimensions are structurally AI-dependent, others are not. The risk at this level is staying here too long. Partial AI Nativeness creates technical debt in both directions: too committed to reverse, too incomplete to compound.

12 to 15: AI Native. AI is the infrastructure. The product, the data, the UX, and the team are built around intelligence as the core architectural assumption. Your competitive advantage compounds with every user interaction.

The score itself matters less than the distribution. A team that scores 3-3-3-1-1 has a clear action plan: fix the improvement loop and the team structure. A team that scores 2-2-2-2-2 across the board has a harder question: are you transitioning toward AI Native, or have you settled into a comfortable middle that will slowly lose ground?

The Honest Conversation This Enables
The value of a scorecard is not the number. It is the conversation the number forces.

Most teams have never explicitly discussed which level they are at on each dimension. The CTO thinks the architecture is Level 3 because the AI is in the critical path. The VP of Engineering knows it is Level 2 because the data layer is still a sidecar. The product lead is frustrated because users interact with the AI through the same command-based interface the product had two years ago.

This misalignment is normal. It is also expensive. Teams investing in Level 3 features on top of Level 1 infrastructure will hit a wall. Teams hiring for Level 3 pod structures while the data layer requires Level 1 centralized specialists will burn through people. The dimensions are not independent. They constrain each other.

The companies that are pulling ahead right now are not the ones with the highest total score. They are the ones where every dimension is within one level of every other dimension. Balanced architecture compounds. Lopsided architecture creates friction that eventually stalls progress.

Run the scorecard with your leadership team this week. Score each dimension independently. Compare notes. The gaps between your individual scores, the places where the CTO sees a 3 and the engineering lead sees a 1, those gaps are where your real architectural debt lives.

What Is AI Native? The One Test That Separates Real from Fake in 2026

Siddharth Bhalsod — Tue, 02 Jun 2026 05:45:31 +0000

Remove the AI from your product. What's left?

If the answer is a slower version of the same thing, you built an AI Augmented product. If the answer is nothing, a hollow shell, a product that cannot function at all, you built an AI Native one. That single question is the sharpest filter in tech right now, and most teams answering it honestly won't like what they find.

The term AI Native is everywhere in 2026. It's on pitch decks, investor memos, job descriptions, product landing pages. Every company that bolted a chatbot onto their existing interface now calls itself AI Native. Every SaaS tool with a "Generate with AI" button claims the label. The result is a phrase that has been stretched so thin it almost means nothing. Almost. Because the real thing still exists, and the gap between the real thing and the impersonators is widening fast.

The Wrong Definition Is Already Winning

Most people define AI Native as "a company that uses AI." By that definition, every company with a ChatGPT API call is AI Native. This is like calling every restaurant with a microwave a molecular gastronomy lab.
The better definition requires understanding what sits underneath the product. An AI Augmented product is a traditional product that added intelligence to existing workflows. The workflows existed before the AI.

The data model existed before the AI. The user experience existed before the AI. Intelligence made things faster, but the skeleton is the same skeleton from 2019. A support tool that uses AI to suggest responses is AI Augmented. Remove the AI and agents still take calls, still resolve tickets, just slower.

An AI Native product is one where intelligence is the skeleton. The data model, the user experience, the architecture, the business logic all presume that AI is present. Remove the AI and nothing coherent remains. There is no "manual mode." There is no fallback workflow. The product simply ceases to exist as a product.

This is not a spectrum. It is a binary test.

Intelligence as Infrastructure, Not Feature

Cursor, the code editor built by Anysphere, is the clearest example of AI Native architecture in production today. It isn't VS Code with a smarter autocomplete bolted on. The editor was built from day one with the assumption that an LLM would be a first-class participant in every coding action. Agent Mode, which handles autonomous multi-file editing, is not a plugin. It is the product. Background Agents run parallel tasks while you work on something else. BugBot reviews pull requests without waiting for a human. Cursor reached $2 billion in annualized revenue by mid-2026 because it did not ask users to adopt AI inside their existing tool. It asked them to adopt a new tool where AI is the tool.

Compare this to GitHub Copilot. Copilot adds AI to an existing editor through a plugin architecture. The editor, VS Code, was designed and shipped before Copilot existed. Copilot makes the editor faster. Remove Copilot and you still have VS Code, fully functional, just without the suggestions. That is AI Augmented. Not worse by definition, but architecturally different in ways that compound over time.

The same pattern plays out in search. Perplexity rebuilt the search experience from scratch around an LLM, treating the model as the interface, not as a helper behind a traditional search box. There is no list of ten blue links with an AI summary pinned to the top. The entire experience is a conversation with citations. Remove the AI and Perplexity is an empty screen. Google, by contrast, added AI Overviews to a search results page that has existed for 25 years. Google Search without AI Overviews is still Google Search. That distinction explains why Perplexity crossed $450 million in annualized recurring revenue in early 2026, growing from a standing start against the most dominant product in internet history.

The Architecture Test Goes Deeper Than the Interface

The "remove the AI" test is useful as a first filter, but the real differences between AI Native and AI Augmented live in the architecture underneath.

In AI Augmented systems, the data pipeline was designed for deterministic software. Data gets structured into rows, columns, relational tables. AI is called as a service at specific points. The result gets injected back into the deterministic flow. This works, but it creates a ceiling. Every time the AI needs context, it has to reach across an abstraction boundary to fetch it. Every time you want to improve the AI's behavior, you are constrained by a data model that was not designed for that purpose.

In AI Native systems, the data layer assumes intelligence will consume it. Context windows, embedding stores, retrieval pipelines, evaluation loops. These are first-class architectural components, not afterthoughts. The system gets smarter as it runs because the architecture was designed to learn, not just to execute. Abnormal Security, which provides AI Native email protection, built its detection system around behavioral models from the start. The AI does not sit on top of a rules engine. The AI is the engine. Static rules, predefined policies, manual intervention, these are gone. Signals get evaluated by models trained on organizational behavior, and the system gets more accurate with every email it processes.

This architectural difference creates compounding advantages. An AI Augmented product improves when engineers ship new features. An AI Native product improves when users use it.

Command-Based vs. Intent-Based: The UX Divide

The clearest way for a non-technical person to feel the AI Native difference is in how the product expects you to interact with it.
AI Augmented products are command-based. You click a button. You fill a form. You navigate a menu. AI accelerates what happens after you give the command, but you still give the command. Zendesk with AI features still runs on a ticketing queue. Agents still manage workflows. The AI suggests responses and categorizes issues, but the interaction model is the same one support teams have used for a decade.

AI Native products are intent-based. You describe what you want. The system figures out how to do it. Claude Code, Anthropic's terminal-based coding agent, does not present an editor interface. You describe the change you want in plain language. The agent writes code, runs tests, debugs failures, and iterates, sometimes resolving issues across dozens of files without you ever opening one. The entire development workflow reorganizes around expressing intent rather than issuing commands.
This shift matters because it changes who can use the product and what they can accomplish. Command-based interfaces require the user to know how the system works. Intent-based interfaces require the user to know what they want. That is a different skill entirely, and it opens the product to people who were previously locked out by complexity.

The Honest Self-Assessment Most Teams Fail

Y Combinator published their requests for startups in 2026 and highlighted a pattern worth paying attention to: the strongest AI Native companies they see have made their entire company queryable. Not just the product. The company. Knowledge, decisions, customer data, operational context, all of it accessible through natural language by anyone on the team.

Most companies are not there. Most companies are not close. A McKinsey survey of 1,400 technology companies found that AI Native products generated 2.6 times faster revenue growth than AI Augmented alternatives in the same market categories. The gap is not theoretical. It shows up in revenue, in customer retention, in how fast a team can move from idea to deployed product.

The honest version of the self-assessment looks like this. If your AI breaks, does the product still work? If yes, you are AI Augmented. That is a legitimate architectural choice and it serves many businesses well. But it is not AI Native, and calling it AI Native will lead you to make the wrong investments, hire the wrong team, and build the wrong roadmap.
If you are competing against someone who is actually AI Native and you are AI Augmented, you are not in a talent disadvantage or a tooling disadvantage. You are in a structural one. They are not doing the same things faster. They are doing different things entirely.

The architecture you choose in the next twelve months is difficult to reverse. Products built around deterministic workflows do not easily transform into products built around intelligence. The data model is wrong. The abstraction layers are wrong. The user expectations are wrong. It is not a refactor. It is a rebuild.

The companies that will matter in three years are making that architectural decision right now, and the first step is being honest about which side of the line they are actually on.