Most teams get the model right on the first try.
They pick Claude or GPT-4o, wire up a few prompts, and ship something that impresses in a demo. Then they spend the next six months wondering why responses drift, costs compound faster than users do, and the system that felt clever in week two feels brittle by month four.
The model was not the problem. The model was never the problem.
The mistake is architectural, and it almost always starts the same way: teams choose the model before they design the data layer. Everything downstream from that sequencing error will cost them.
The Wrong Starting Point
Here is how most teams actually build: they pick a foundation model, write prompts that work for their test cases, and then figure out how to feed the model the right information. The data layer is treated as a support function for the model. A retrieval step to bolt on. Something to sort out later.
This is backwards.
In AI Native systems, the data layer is not a supporting actor. It is the foundation that determines whether the model can do anything useful at all. A well-prompted model operating on stale, poorly structured, or imprecisely retrieved data will underperform a weaker model operating on clean, fresh, semantically precise context. Every time.
What AI Native infrastructure actually looks like is five distinct layers, each with a specific job, and each dependent on the one below it. Start from the bottom, not the top.
Layer One: The Embedding Store
Before any user query fires, before any retrieval logic runs, data has to be prepared. Raw documents, knowledge bases, product catalogs, customer history — whatever domain knowledge the system needs — must be converted into vector embeddings and stored in a way that allows fast, semantically relevant retrieval.
This is the embedding store, and the choices made here reverberate through the entire system.
The first real decision is managed versus self-hosted. Pinecone is the category-default managed option: operationally simple, scales without tuning, and handles multi-region distribution natively. For teams that want full control over their infrastructure without a managed service dependency, Qdrant — built in Rust — delivers the lowest retrieval latency of any open-source vector database and handles complex metadata filtering cleanly. Weaviate sits in between: open-source, self-hostable, with native hybrid search that combines semantic and keyword retrieval without external tooling.
For teams already running Postgres, pgvector is worth a serious look before adding a dedicated vector database. Production-grade since the 0.7 release, it handles up to roughly 50 million vectors on a well-provisioned instance. The operational savings of not running a separate system are real, and the retrieval quality is equivalent to purpose-built options at that scale.
The second decision, less discussed and more consequential, is chunking strategy. How documents are split before embedding determines what the model can actually retrieve. Fixed-size chunks with no attention to semantic boundaries produce retrieval that regularly cuts a sentence in half, drops the precise clause that answers the query, or returns paragraphs that contain the right word in the wrong context. Semantic chunking — splitting on paragraph breaks, section boundaries, or structural signals within the document — consistently outperforms fixed-size approaches. It adds complexity upfront. It is worth it.
A third decision compounds: whether to use dense retrieval only (pure vector similarity) or hybrid retrieval that combines vector search with keyword matching. For domain-specific vocabularies — product codes, technical terms, proper nouns — pure semantic search regularly misses exact-match queries. Qdrant and Weaviate both offer hybrid retrieval that fuses dense and sparse scores without external tooling. For most production systems serving real users on real content, this is the right default.
Layer Two: The Retrieval Pipeline
The embedding store holds the vectors. The retrieval pipeline is the logic that decides which ones to surface, in what order, and in what form.
Most teams treat retrieval as a single step. Query comes in, nearest neighbors come back, those chunks go into the prompt. This works well enough in demos. In production, with real query distributions and real document variance, it degrades predictably.
Production retrieval pipelines have three stages:
Query transformation happens before the vector search. The user's literal input is rarely the best query to run against the embedding store. A user asking "how do I cancel?" might be best served by retrieving chunks about cancellation policy, refund terms, and account deletion procedures simultaneously. Rewriting the query, expanding it into multiple sub-queries, or using the conversation history to disambiguate intent before retrieval is the difference between a system that retrieves what the user typed and one that retrieves what the user meant.
Retrieval and re-ranking is the search step itself, followed by a second pass that re-scores the top candidates for relevance before passing them to the model. Bi-encoder models (the ones that power standard vector search) optimize for broad recall. Cross-encoder re-rankers optimize for precision among the top results. Running both — retrieve broadly with the bi-encoder, then re-rank the top 20 results with a cross-encoder before selecting the final context — produces meaningfully better retrieval quality than either approach alone, at a latency cost that is usually under 50 milliseconds.
Context assembly is the final step before the prompt. Which chunks to include, in what order, how to handle redundancy across chunks, whether to add metadata like document date or source type — these decisions shape what the model sees. Models perform better when the most relevant context appears at the beginning of the context window, not buried in the middle. Position matters more than engineers typically expect.
Layer Three: Context Management
This is where most teams discover that they had an implicit assumption they never examined.
They assumed context would stay small enough to not matter.
Context management is the layer that tracks what the model needs to know within a session, across sessions, and at the system level — and makes deliberate choices about what to include, what to compress, and what to discard. It sounds simple. In practice, it is the layer that silently determines whether the system feels coherent or amnesiac, expensive or cost-efficient.
The clearest failure mode is context stuffing: including everything the system might need, in full, on every request, because it is easier than deciding what to exclude. At low traffic volumes this is fine. At scale, the token cost compounds fast, latency climbs as the context window fills, and the model's attention degrades on long-context inputs. An enterprise application routing ten thousand requests per hour through a 128K context window, when 60K of that context is the same static background information repeated verbatim on every call, is not a data architecture problem — it is an engineering decision that has simply not been made yet.
Effective context management has three components. A session layer tracks the immediate conversation and recent user actions, kept compact, summarized aggressively after the first few turns rather than appended indefinitely. A memory layer handles what the system should retain across sessions — user preferences, prior decisions, domain-specific facts about this user's context — stored as structured records, not as raw conversation history. And a system layer manages the baseline context that every request needs: the product's core knowledge, current configuration, and any real-time state the model should be aware of.
The goal is not minimalism for its own sake. It is precision. The right context, fresh, in the right position, without padding.
Layer Four: The Eval Framework
Everything built so far produces outputs that cannot be tested with a passing or failing unit test. The model might return a factually correct response in the wrong format. It might answer the literal question while missing the user's actual intent. It might perform well on the examples in your test suite and drift on the long tail of real queries that you have not seen yet.
Eval infrastructure is what makes AI Native systems improvable, rather than just deployed.
The production pattern that engineering teams are converging on in 2026 uses two tools with a clear division of labor. A lightweight open-source framework handles CI/CD gating at the PR level: DeepEval is the closest thing the LLM eval world has to pytest, running assertion-style tests against model outputs on every code change. RAGAS handles retrieval-specific metrics — context precision, answer faithfulness, answer relevance — for RAG-heavy systems. These run in the pipeline, automatically, before any change ships.
A second tier handles production monitoring and regression tracking: Braintrust for dataset-first prompt regression workflows with human annotation, or Arize Phoenix for teams that need production observability alongside evaluation. The two tiers run together. Unit-level evals catch regressions before deployment. Production evals catch drift after it.
The discipline that separates teams who use evals from teams who have eval infrastructure is this: the metrics are defined before the system is built, not after. What does "correct" mean for this use case? What does "faithful" mean? What does "hallucinated" mean, specifically, for this domain? These are design questions, not measurement questions. Teams that get this right start their architecture work at the eval layer. Teams that get this wrong discover they cannot measure progress at the point when it matters most.
Layer Five: The Gateway
The LLM gateway is the layer that most teams add last. It should be among the first decisions made.
A gateway sits between your application and every model provider. It handles routing, cost controls, caching, failover, and observability — functions that are not optional at any meaningful production scale, but that most teams implement as ad hoc logic scattered across application code until a provider outage or a cost spike forces the issue.
At scale, the case is not theoretical. Teams running production AI workloads that skip this layer see token spend compound 30 to 40 percent faster than necessary from redundant identical requests that a semantic cache would have served without an inference call. They carry outsized operational risk during provider outages that proper failover configuration would absorb. They cannot attribute costs to teams or features because there is no central point of control.
Bifrost, an open-source gateway built in Go, handles 5,000 requests per second at 11 microseconds of overhead — low enough that it adds no perceptible latency to the inference call. LiteLLM is the most widely deployed open-source option for teams that want a Python-native solution with broad provider coverage. Cloudflare AI Gateway is the lowest-friction managed option for teams that want zero infrastructure maintenance. Kong AI Gateway integrates into existing API management infrastructure for enterprise environments already running Kong.
The right choice between them matters less than the decision to have one. Without a gateway, every team inevitably rebuilds fragments of it at the application layer: manual retry logic, cost tracking spreadsheets, per-feature model selection buried in function calls. The gateway consolidates that logic into a single, auditable layer. When a provider goes down at 2am, the failover runs automatically. When a new model releases and you want to test it on five percent of traffic, you change one configuration line.
The Right Build Order
The mistake is not in the individual layer choices. Most teams are thoughtful about which embedding store they pick, which eval framework they try. The mistake is in the order.
Teams that start with the model end up retrofitting the infrastructure around a system that was already making assumptions about what the data layer would eventually provide. The embedding store gets added to support a retrieval pattern that the prompt design has already locked in. The eval framework gets added when the system is already live and there is no baseline to regress against. The gateway gets added when the first cost spike arrives.
Teams that start with the data layer make different decisions. They define what "good retrieval" means before they write a prompt. They choose their embedding store based on the query patterns their system will actually need to support. They design the context management strategy before they know how often it will need to run.
The model sits at the top of this stack, not the bottom. It is the most visible layer. It is the layer that produces the output the user sees. But it is the last thing to configure, not the first.
Starting with the model is like designing a building by choosing the facade material before you know the load-bearing structure. The facade is what people will look at. The structure is what holds it up.
Build the structure first.





Top comments (0)