DEV Community: Riddhesh

Should You Be Building on MCP in 2026?

Riddhesh — Wed, 29 Apr 2026 10:06:06 +0000

What Is MCP (Model Context Protocol) and Why Does It Matter?

Before 2024, connecting an AI agent to external tools meant writing custom integration code for every single combination. GitHub connector for Claude, a different one for GPT-4, another for your internal LLM. Each model, each tool, and each team has its own bespoke glue layer.

That is the N×M problem. N AI models multiplied by M tools equals N×M custom integrations to build, maintain, and debug. For a team running three models across ten internal tools, that is thirty separate connectors. Every new model you add multiplies the work.

Model context protocol collapses that to N+M. Each AI model implements the MCP client protocol once. Each tool or data source exposes an MCP server once. Any client can talk to any server without additional integration code.

The USB-C analogy that circulates in every MCP explainer is accurate for once. One protocol, one port, everything connects. But here is the part most explainers leave out: MCP does not replace REST APIs. It sits above them. Your existing APIs still do the actual work.

MCP is the orchestration layer that makes those APIs intelligible to an AI agent at runtime, not at build time.
That distinction matters when you are making architecture decisions.

How MCP Took Over in 18 Months: The Adoption Timeline

The adoption curve for MCP is unlike most open standards. It did not take years to reach critical mass. It took one.

Milestone	Date	Significance
Anthropic launches MCP	November 2024	Open standard released with TypeScript and Python SDKs
OpenAI formally adopts	March 2025	Cross-vendor standardization confirmed
Microsoft Copilot Studio integration	July 2025	Enterprise channel opens
AWS adds support	November 2025	Cloud infrastructure layer adoption complete
Donated to Linux Foundation (AAIF)	December 2025	Single-vendor dependency permanently removed
97M monthly SDK downloads, 10K+ active public servers	March 2026	De facto infrastructure status

That last milestone is the one that changes the build decision. This is not a framework winning a competition. It is a standard that has already won. When OpenAI, Google, Microsoft, and AWS all implement the same protocol and donate governance to a neutral foundation, the question of whether MCP becomes the standard is closed. It already is.

According to enterprise AI adoption data, 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5% in 2025. Every one of those agents needs to talk to tools. MCP is how they do it.

What MCP Actually Solves for Developers

The N×M Integration Problem

Industry reports indicate that connecting a legacy service via traditional REST APIs consumed three to five days of senior developer time per integration, before factoring in ongoing maintenance.

Adopting MCP can reduce initial integration development time by up to 30% and lower ongoing maintenance costs by up to 25%, simply by eliminating the need to write custom connector code for every new AI platform.

We have felt this directly. Before MCP in our stack, adding a new data source to an agent meant a full day of wiring: auth handling, response normalization, error mapping, and testing the edge cases. With MCP, a server that already exists connects in minutes.

Dynamic Tool Discovery

This is the architectural shift most developers underestimate until they have built a real agentic system.

Traditional REST assumes the client knows exactly which endpoint to call. A developer reads the docs, hardcodes the request, and ships it. That works fine for software calling software.

AI agents operate differently. They need to ask at runtime, "What can I do here?" MCP servers answer that question through a tools/list call. The agent discovers available capabilities, reasons about which to use, and chains them across multiple steps all without any of that logic being hardcoded at build time.

Stateful Sessions vs Stateless APIs

REST is stateless by design. Every request carries its full context. For human-written software, this is a feature it makes APIs predictable and easy to cache.

For multi-step agent tasks, being stateless is a liability. When an agent is working through a sequence list of open PRs, summarize each, flag the stale ones, and close them if the context from step one is still available in step four.

MCP sessions maintain that state. No re-authentication, no re-sending context, no re-establishing what the agent already knows.

MCP vs REST API vs Function Calling: Which One Do You Actually Need?

This is the comparison most teams need before making an architecture decision, and it is the one nobody writes cleanly.

Approach	Designed For	State	Tool Discovery	Best Use Case
REST API	Developer-written software	Stateless	Manual, hardcoded	Direct integrations, high-volume deterministic operations
Function Calling (native)	Single-model tool use	Per-call context	Defined in system prompt	Simple tool use within one model, known toolset
MCP	AI agents, multi-model systems	Stateful sessions	Dynamic, runtime	Multi-tool agentic workflows, cross-platform agent systems

When REST still wins

Payment processing, high-frequency deterministic operations, any integration where a human writes the calling code. REST is also significantly more mature on security OAuth 2.0, JWT, mTLS, and API gateway patterns have been battle-tested for over a decade.

When function calling is enough

If you are using one model and your toolset is small and stable, native function calling is simpler. There is no server to run, no session to manage, and debugging is straightforward. Do not introduce MCP complexity for a use case that two tool definitions in a system prompt can handle.

When MCP earns its complexity

More than three to five integrations, a need for dynamic tool discovery, cross-platform agent deployments where the same tools need to work with multiple AI models, or enterprise environments where centralized governance of agent actions is a requirement.

The Real Problems With MCP in Production Nobody Talks About

MCP works great in demos. The production reality is messier, and most content about MCP right now skips the parts that will actually cost you time and money.

Security Is Genuinely Immature

This is the section most MCP posts omit. It should not be omitted.

Security research covering 2,614 MCP implementations found that 82% had file operation vulnerabilities to path traversal attacks. Two-thirds had some form of code injection risk. Over a third were susceptible to command injection. These are not theoretical every category has confirmed CVEs with public exploits.

The first two months of 2026 alone saw 30+ MCP-specific CVE filings from researchers at Check Point, Invariant Labs, and Adversa AI.
The root cause is architectural.

The MCP specification does not include built-in authentication or authorization. Every server you deploy inherits whatever permissions it has been granted. Every agent request flows through without verification unless you add controls externally, and most teams do not add them correctly or at all.

Tool poisoning is a specific risk worth calling out. Malicious tool descriptions embedded in an MCP server's manifest can inject hidden instructions that the LLM reads and obeys without the user ever seeing them. Unlike prompt injection through user input, this attack is embedded in the protocol layer itself.

Supply chain risk is real and has already caused incidents. A malicious package impersonating a legitimate email service was uploaded to an MCP registry, quietly exfiltrating API keys from developers who installed it.

According to Gartner's 2026 security predictions, 25% of enterprise GenAI applications will experience at least five minor security incidents per year by 2028. MCP's current security posture, no built-in auth, and inconsistent registry vetting are evolving species and a direct contributing factor if teams treat it like a mature standard when it is not.

Context Window Bloat

Connecting ten MCP servers with five tools each burns thousands of tokens before the agent does anything useful. Every tool schema loads into the context window at session initialization. On a 128K context window, that overhead is a real tax on both cost and latency that compounds with every additional server you connect.

Debugging Is Harder Than With REST

When something goes wrong with a REST API call, you reproduce it with a curl command. Copy the request, run it, and inspect the response. Deterministic and fast.

With MCP, a failure involves reading JSON-RPC transport logs, verifying the server process is still running, checking whether session state was corrupted, and determining whether the tool schema was cached incorrectly.

Stateful sessions mean failures are harder to reproduce in isolation. There is no mature equivalent of Postman for MCP debugging yet.

The Ecosystem Is Still Maturing

Many public MCP servers still accept unauthenticated calls. Auth implementation quality varies dramatically. Registry vetting is inconsistent. The developer community is aware and moving to address it, but that is not the same as solved.

If you are connecting third-party MCP servers to production systems today, you are accepting risk that has no industry-standard mitigation yet.

What Anthropic's MCP Strategy Actually Signals

The decision to donate MCP to the Linux Foundation under the Agentic AI Infrastructure Foundation, co-founded with Block and OpenAI with participation from Google, Microsoft, AWS, and Cloudflare, was the move that matters most.

This was not a marketing decision. Donating governance to a neutral foundation removes the single-vendor risk that would otherwise limit enterprise adoption.

It is the same structural move that made Kubernetes safe to bet on. MCP now sits alongside Kubernetes and PyTorch in the Linux Foundation portfolio a signal that carries real weight in enterprise architecture decisions.

Anthropic's Claude Code integrates MCP natively, with explicit approval required for tool calls. That human-in-the-loop default is the right security posture. It reflects what a mature MCP implementation should look like not the "connect everything and let the agent run" approach that characterizes most early-stage deployments.

MCP becomes invisible plumbing. The same way you do not think about TCP when you make a network request, the goal is that you will not think about MCP when your agent calls a tool. That abstraction layer is years away from being seamless. But the architectural bet is sound.

When You SHOULD Build on MCP in 2026

Build on MCP when these conditions are true:

You are building multi-tool agentic workflows where the agent needs to discover and chain tools dynamically rather than execute a fixed sequence
Your team manages more than three to five external integrations for AI and the custom connector maintenance cost is already visible
You are building developer tooling that needs to plug into Claude Code, Cursor, or other MCP-native AI environments your users will expect it
You need centralized audit trails and governance across agent actions, which MCP's session model enables more cleanly than ad-hoc API wiring
You are building for enterprise deployments where the same agent tools need to work across multiple AI platforms without rebuilding connectors for each
You want to stop rebuilding integrations every time a new model provider is adopted by your customers

When You SHOULD NOT Build on MCP

Stop before adding MCP complexity when:

You have one or two integrations and a direct API call is cleaner do not over-architect a simple problem
Your use case is deterministic and does not require dynamic tool discovery at runtime
Your security posture requires proven, audited auth patterns REST with OAuth 2.0 is significantly more mature right now
You are in early prototyping and the overhead of running MCP servers and managing sessions slows down your ability to validate the idea first
Your team has no prior experience with JSON-RPC or the MCP spec and the learning curve is not justified by the complexity of the use case yet

What a Production-Ready MCP Implementation Actually Looks Like

Most tutorials show you the happy path. Here is what the production path requires.

Server Design: Keep the Surface Area Small

Expose only the tools the agent actually needs for the task. Every tool you expose is part of your attack surface.

Tool descriptions are read by the LLM treat them as security-sensitive inputs and review them with the same scrutiny you apply to code.

Authentication and Authorization: Do Not Trust the Defaults

The MCP spec does not enforce auth you are responsible. Use OAuth with least-privilege scoping per tool, not per server. Rotate credentials.

Avoid static API keys in config files. If a key leaks, it should grant access to one tool, not your entire integration layer.

Observability: Your Forensic Trail

Log every tool call with full context: tool name, parameters, agent identity, session ID, timestamp. Without this, diagnosing a bad agent action after the fact is close to impossible.

This is not optional infrastructure it is the difference between a system you can debug and one you are flying blind in.

Supply Chain Hygiene

Vet every third-party MCP server before connecting it to production. Review source code. Pin versions. Do not auto-update. One malicious package in your agent's tool chain is enough to exfiltrate credentials or compromise infrastructure.

According to cybersecurity research, supply chain risks remain one of the most underestimated attack vectors for organizations adopting new integration standards quickly.

Human-in-the-Loop Controls

High-stakes tool calls anything that deletes, writes, deploys, or sends should require explicit human confirmation before execution. Claude Code requires this by default. Your custom agent should too.

What It Actually Costs to Adopt MCP

Teams consistently underestimate this. Here is a realistic cost map:

Cost Area	What Drives It	Key Insight
Context token overhead	Tool schemas loaded at session init	10 servers × 5 tools = thousands of tokens before work begins
Server maintenance	Version pinning, security patches, registry monitoring	Ongoing ownership, not a one-time setup
Security implementation	Auth, audit logging, supply chain vetting	20-30% budget increase if retrofitted after initial build
Debugging infrastructure	Observability tooling, log aggregation, session tracing	No mature off-the-shelf MCP debugger exists yet
Engineering ramp-up	JSON-RPC familiarity, MCP spec learning curve	1-2 weeks for an experienced backend engineer

The token overhead is the cost most teams do not model until they see their first production bill. A session loading ten MCP servers before answering a single user query is paying a fixed per-query tax in context tokens. At production query volumes, that accumulates fast.

The Future: MCP Becomes Invisible Infrastructure

Here is where this is heading: developers stop asking whether to adopt MCP and start asking how their integration layer is configured.

The same way you do not debate TCP when you build a web service, you will not debate MCP when you build an agent. The protocol becomes the assumed substrate.

Google's A2A (Agent-to-Agent) protocol handles agent-to-agent communication. MCP handles agent-to-tool communication. These two standards together define the connective tissue of production agentic AI A2A for agents coordinating with each other, MCP for agents interacting with the world.

AI agent researchs reinforces this: by 2029, at least half of knowledge workers are expected to be creating, governing, and deploying agents on demand. The integration layer that connects those agents to tools is MCP.

The teams building security discipline and observability into MCP deployments today are not just solving a current problem. They are building the operational foundation that will matter for the next five years of agentic AI.

The abstraction will rise. MCP will become invisible. The engineering judgment required to implement it correctly will not.

Key Takeaways

Model Context Protocol has already won the standards race. 97M monthly SDK downloads, Linux Foundation governance, every major AI vendor committed. The adoption question is closed.
MCP does not replace REST APIs. It is an orchestration layer above them. Your existing APIs still do the actual work.
The N×M problem is real. MCP solves it by reducing N×M custom connectors to N+M single-protocol implementations. The integration time savings are measurable and compound as your agent toolset grows.
Security is the part most tutorials skip and the part that will cost you most. 82% of surveyed implementations have path traversal vulnerabilities. The spec has no built-in auth. You are responsible for all of it.
Do not add MCP complexity for use cases that one or two direct API calls can handle. The token overhead, server maintenance, and debugging complexity are not free.
A production-ready MCP deployment requires: minimal tool surface area, proper OAuth, trace-level observability, supply chain vetting, and human-in-the-loop controls on high-stakes actions. Budget for all of it from day one retrofitting security costs 20-30% more than building it in.
Gartner projects 40% of enterprise apps will embed AI agents by end of 2026. Every one of those agents will need a tool layer. MCP is it.
Build on MCP when dynamic tool discovery, multi-platform interoperability, or integration scale justifies the complexity. Use direct APIs or function calling when it does not.

Building production MCP systems and have war stories about what actually broke? Drop them in the comments. The useful knowledge is always in the failures nobody writes case studies about.

Should You Be Using RAG in 2026?

Riddhesh — Wed, 15 Apr 2026 09:12:28 +0000

Every few months, someone drops a hot take that RAG is dead. Then the next enterprise deal closes, and everyone goes back to building pipelines.

The Current Narrative Around RAG

Two positions dominate the conversation right now.
The first: RAG is dead. Long context windows from Anthropic and Google have made retrieval unnecessary. Pass everything into the prompt and call it done.

The second: RAG is the backbone of enterprise AI and nothing has changed. Stick to the patterns, ship the pipeline.
Both are wrong, and both are reacting to surface-level signals rather than production reality.

Teams calling RAG dead are confusing context window size with retrieval discipline. Yes, Claude Opus 4.6 now has a 1M-token context window at standard pricing with no long-context surcharge.

That is a meaningful shift. But a larger context window does not eliminate the need for choosing what goes into it.

The other camp ignores the real complexity and cost that most teams underestimate until they are already deep in production.

RAG is not dead. But the version most teams are building in 2026 is not the version that was production-ready two years ago.

What RAG Actually Solves

LLMs hallucinate. That is not a bug being patched it is a fundamental property of how autoregressive generation works. Models predict plausible next tokens, not verified facts.

Research shows that hallucinations remain prevalent in complex reasoning and open-domain factual recall, where error rates can exceed 33%. In a customer-facing application, that is not a product quirk. That is a liability.

Production-grade RAG, when combined with guardrails and evaluation, reduces hallucinations by 40–96% depending on the stack and the use case, as shown in recent AI evaluation studies and benchmarks.

We have seen this range in our own builds. A well-tuned pipeline with hybrid retrieval and a reranking layer lands closer to the high end of that range. A naive one-shot vector search lands near the low end.

RAG does not eliminate hallucinations. It constrains the model to a verified knowledge boundary. That distinction matters.

Static Knowledge Limitations

Every LLM has a training cutoff. The knowledge is frozen. For most enterprise use cases, policies, product documentation, internal SOPs, and regulatory guidelines, the information that matters is the information that changed last Tuesday, not last year.

RAG solves this by treating retrieval as a real-time layer. The model never needs to be retrained. Your knowledge base gets updated. The system stays current.

Accessing Private and Enterprise Data

This is the most underrated reason enterprises build RAG in 2026. It is not about hallucinations or knowledge cutoffs alone. It is about the fact that the most valuable data a company has is not in any public LLM's training set, and it never will be.

Customer history, internal research, product specs, compliance documentation. None of that will be absorbed into a foundation model. RAG is how you connect a capable LLM to your actual organizational knowledge without retraining anything.

The Market Reality

Here is a number that should end the RAG is dead debate the technology, growing at nearly a 50% CAGR annually, is not dead technology.

According to Grand View Research, the global RAG market was valued at USD 1.2 billion in 2024 and is projected to reach USD 11 billion by 2030, growing at a CAGR of 49.1%. Precedence Research puts the 2026 market at USD 2.76 billion, expanding toward USD 67.42 billion by 2034 at a CAGR of 49.12%.

Dead tech does not grow this fast.

Metric	Figure	Source
2024 RAG Market Size	$1.2 billion	Grand View Research
2026 RAG Market Size	~$2.76 billion	Precedence Research
2030 Projected Size	$11 billion	Grand View Research
CAGR (2025–2030)	~49%	Grand View Research / Precedence Research
North America Market Share (2024)	~37%	Grand View Research
Asia Pacific Growth Rate	Fastest-growing region	Multiple sources

The demand signal is real. Enterprises in healthcare, legal, finance, and customer service are deploying RAG because the alternative fine-tuning, retraining, or relying on generic model knowledge does not meet compliance, accuracy, or freshness requirements.

How RAG Has Evolved (2024 to 2026)

RAG v1 vs v2 vs v3

What most engineers built in 2023–2024 was RAG v1. It looked impressive in demos. It fell apart in production. Here is how the architecture has matured:

Generation	Core Pattern	Retrieval Method	Primary Weakness
RAG v1 (2023)	Single vector store + LLM	Dense embedding search	Keyword misses, poor ranking
RAG v2 (2024)	Hybrid search + reranking	Dense + sparse (BM25)	Chunking quality, latency
RAG v3 (2025–2026)	Agentic pipelines + GraphRAG	Multi-stage + knowledge graphs	Complexity, evaluation gaps

Key Improvements in Modern RAG

Hybrid search is now standard. Dense vector search alone misses exact-match queries. Sparse retrieval (BM25) misses semantic intent. You need both, combined with a fusion layer. Teams that skip this step are leaving retrieval quality on the table.

Reranking is the layer most teams skip and then regret. A cross-encoder reranker takes the top-k retrieved chunks and reorders them by actual relevance to the query, not just vector similarity. This single step often produces more improvement than spending weeks on embedding model selection.

Multi-stage pipelines are replacing single-shot retrieval. Query decomposition, sub-query routing, context filtering, and synthesis are now separate stages with their own quality signals. More complex to build. Significantly more reliable in production.

The Real Problems With RAG

RAG works great in demos. It breaks in production.

Latency

A naive RAG pipeline adds 800ms to 2 seconds of latency per query. Embedding the user query, hitting the vector store, reranking retrieved chunks, and then generating a response each step has a cost. Users expect sub-500ms responses. You have to engineer aggressively to hit that.

Caching, pre-fetched indexes, streaming responses, and careful pipeline profiling are not optional. They are the difference between a product and a prototype.

Chunking Complexity

Chunking is where most RAG systems silently fail. Fixed-size chunking (splitting documents every 512 tokens) breaks semantic units. A sentence that starts on one chunk and ends on the next becomes unretrievable. An answer that requires context from two adjacent paragraphs gets severed.

Semantic chunking, section-aware splitting, and metadata-rich indexing are now table stakes, but they add meaningful engineering overhead that nobody budgets for at the start of the project.

Retrieval Failures

The model can only work with what you retrieve. If the retrieval step returns three irrelevant chunks, the LLM either hallucinates to fill the gap or produces a useless non-answer.

Retrieval quality is the ceiling on RAG quality. Most evaluation frameworks miss this because they test generation quality, not retrieval precision.

Evaluation Challenges

RAG evaluation is genuinely hard. You need to measure retrieval recall, context relevance, answer faithfulness, and answer correctness four separate signals, each requiring its own methodology. Teams that skip structured evaluation ship systems they cannot diagnose when they degrade.

RAG vs Fine-Tuning vs Agents

This is the question every technical founder should have a clear answer to before choosing an architecture.

Approach	Pros	Cons	Best Use Case
RAG	No retraining, fresh knowledge, explainable, cost-effective	Latency overhead, retrieval failures, chunking complexity	Dynamic knowledge, private data, compliance-sensitive use cases
Fine-Tuning	Faster inference, domain-specific tone/format, no retrieval	Expensive, knowledge is static, hard to update, requires quality data	Consistent style/format, specialized reasoning, narrow domains
Agents	Dynamic, multi-step, tool-using, self-correcting	High latency, unpredictable, expensive per query, hard to evaluate	Complex workflows, autonomous research, multi-system orchestration

RAG wins when your primary constraint is knowledge freshness and accuracy over private data. Fine-tuning wins when your primary constraint is style consistency, inference speed, or domain-specific formatting. Agents win when the task is inherently multi-step and requires dynamic decision-making.

The mistake most teams make is defaulting to one architecture for everything. The right call is almost always a hybrid: RAG for grounding, light fine-tuning for tone consistency, and agentic orchestration for workflows that need planning.

Why are agents rising so fast? Because modern LLMs are capable enough to execute multi-step plans reliably. RAG becomes one tool that an agent calls, rather than the whole system. This is the direction production architectures are moving.

What Modern AI Companies Are Changing

Long Context Windows Are Real But Not a Replacement

Anthropic made its 1M-token context window generally available for Claude Opus 4.6 and Sonnet 4.6, removing the long-context pricing premium entirely. A 900,000-token request now costs the same per token as a 9,000-token one.

Claude Opus 4.6 scored 78.3% on MRCR v2 at 1M tokens, the highest recall among frontier models compared to Gemini 3 Pro's 26.3% at the same context length, according to official reports.

This matters. It changes the cost math for some use cases. A team that previously used RAG to process a 200-page legal document can now consider stuffing the full document into a single prompt for a fraction of what it cost six months ago.

Why Context Does Not Equal Retrieval

But here is what gets lost in the benchmarks for example 1M token context window and a well-structured retrieval pipeline solve different problems.

Context windows are static per-request. You load what you know needs to be there. RAG is dynamic it retrieves what is relevant at query time across a corpus that may contain millions of documents.

No context window is large enough to hold an enterprise knowledge base. And even if it were, you would not want to pay to process it in full on every single query.

As Anthropic's own pricing analysis shows, flat-rate long-context pricing makes the economics more predictable and linear but a single 1M-token prompt costs $3 at Sonnet rates, which adds up significantly at production query volumes.

Where RAG Still Fits

RAG remains the right call for:

Knowledge bases larger than any context window
High-query-volume workloads where cost per query matters
Use cases requiring source citations and explainability
Real-time document ingestion where the corpus updates continuously

Long context is a tool. RAG is an architecture. They are increasingly complementary, not competitive.

When You SHOULD Use RAG

Use RAG when all of the following are true:

RAG makes sense when your data is private, proprietary, or constantly evolving because pretrained models can’t keep up with changing information, while retrieval ensures responses stay aligned with your latest data.
It becomes essential in scenarios where accuracy and citation matter since responses can be tied back to actual source documents, making them easier to trust, verify, and audit.
For large knowledge bases that exceed practical context window limits, retrieval helps narrow things down only the most relevant chunks are injected, which improves both efficiency and output quality.
In regulated industries like healthcare, legal, or finance, having explainable outputs is critical RAG provides a clear link between responses and approved knowledge sources.
When answers need to come from specific internal documents rather than general model knowledge, retrieval acts as the bridge between your data and the model’s reasoning.
If your information changes frequently, maintaining a vector database is far more practical than retraining or fine-tuning models every time something updates.
That said, all of this only works well if the system is implemented properly retrieval quality depends heavily on chunking, embeddings, and ranking, so weak setups can actually hurt performance.

Enterprise data indicates that organizations are choosing RAG for 30-60% of their AI use cases specifically where high accuracy, transparency, and reliable outputs over proprietary data are required.

When You SHOULD NOT Use RAG

Stop reaching for RAG when:

Tasks that rely on general knowledge like coding help, writing, or basic Q&A adding retrieval usually creates unnecessary overhead without improving results.
Smaller, static datasets don’t really justify the complexity if everything fits into a prompt, direct context injection is often cleaner and more reliable.
In low-query environments, the trade-off shifts passing full documents per request can be simpler than maintaining a full retrieval pipeline.
Real-time or latency-sensitive applications can struggle with RAG, since the added retrieval and reranking steps introduce delays that may not be acceptable.
Teams without experience in retrieval systems often run into issues poor chunking or irrelevant results can degrade output quality instead of improving it.
During early prototyping, simplicity wins validating the idea with basic prompting is faster, and retrieval can always be layered in once the need becomes clear.

Many teams would be better served by a well-crafted system prompt with a few embedded documents than by a full RAG pipeline. Start simple. Add retrieval when you actually need it.

What a Production-Ready RAG Stack Looks Like in 2026

Retrieval Layer

Hybrid search combining dense vector search and BM25 sparse retrieval with a score fusion layer. Embedding model selection matters domain-specific embeddings consistently outperform general-purpose ones for specialized corpora, which is why understanding embeddings before you pick a model is worth the time.

Reranking

A cross-encoder reranker sits between retrieval and generation. It takes the top-20 retrieved chunks and reorders them by actual query relevance before passing the top-5 to the LLM. This is the highest-leverage improvement most teams are not making.

Context Filtering

Metadata filters, access controls, and document freshness scoring. Not every retrieved document should reach the model.

Stale documents, low-confidence chunks, and access-restricted content need to be filtered before they can contaminate the response.

Caching

Semantic caching for repeated or similar queries. Query-level caching reduces both latency and cost at scale.

Anthropic's prompt caching makes long-context RAG workflows significantly cheaper for repeated queries against the same document corpus the first read costs full price, and subsequent reads come at a fraction of the cost.

Observability

This is the piece that kills RAG systems in production. You need trace-level logging across every pipeline stage and what was retrieved, what was scored, what was passed to the model, and what was generated.

Without this, debugging a bad answer is impossible. Tooling like LangSmith, Arize, and custom logging layers are no longer optional.

What It Actually Costs to Run RAG

Teams consistently underestimate this. Here is a realistic breakdown:

Infrastructure cost

Vector database, embedding model inference, and reranking model inference. At modest scale (100K queries/month), this is $500-$2,000/month in infrastructure depending on your stack. At 1M+ queries/month, this becomes a line item in the P&L.

LLM cost

At Sonnet 4.6 pricing ($3 input / $15 output per million tokens), a RAG response that passes 5 chunks of 400 tokens each costs roughly $0.006 per query in input tokens alone. At 100K queries/month, that is $600/month just in context passing before generation.

Latency tradeoff

Production RAG systems average 1.2-2.5 seconds end-to-end. If you need sub-500ms, you are engineering against the grain of the architecture. Aggressive caching and streaming can help, but they add complexity.

Maintenance complexity

Someone has to own the chunking pipeline, the embedding pipeline, the index refresh schedule, and the retrieval evaluation framework. This is not a set-and-forget system. Budget accordingly.

Area	What Drives Cost / Complexity	Key Insight
Infrastructure	Vector DB, embeddings, reranking	Costs scale non-linearly as query volume increases
Token Consumption	Retrieved context size	More chunks = better accuracy, but higher per-query cost
Generation	Output length + model choice	Often overlooked, but can exceed retrieval costs
Latency	Retrieval + reranking pipeline	Speed trade-offs are architectural, not easily optimized
System Maintenance	Pipelines, indexing, evaluation	Requires continuous ownership, not a one-time setup

RAG doesn’t just add cost it shifts your system from a simple API call to a continuously managed infrastructure layer.

How RAG Is Becoming Invisible Infrastructure in AI Systems

Here is where I think this is heading: RAG stops being a product decision and starts being infrastructure.

In the same way that most teams do not think about how their application talks to a database it just does RAG will become an invisible layer in AI systems. The retrieval, chunking, and indexing will happen automatically as part of the model's operating environment.

Anthropic's push toward MCP (Model Context Protocol) is an early signal of this. As of early 2026, MCP has been adopted by OpenAI, Google, and Microsoft, and donated to the Linux Foundation becoming an industry standard for AI agent integration with external data sources.

The question will not be "should we use RAG?" but "how is our context layer configured?" The abstraction will rise. The engineers building the abstraction will still need to understand everything in this post.

RAG is not going away. It is going deeper.

Key Takeaways

RAG is not dead. The market is growing at ~49% CAGR. Dead tech does not scale like that.
Long context windows from Anthropic and Google change the cost math but do not eliminate the need for retrieval at enterprise scale.
The version of RAG that fails in production is RAG v1. Hybrid search, reranking, and observability are now required, not optional.
RAG reduces hallucinations by 40–96% in well-engineered stacks. But retrieval quality is the ceiling. You cannot generate your way out of a bad retrieval.
The real competition is not RAG vs long context. It is RAG vs fine-tuning vs agents and the right answer is almost always a hybrid depending on the use case.
Do not build RAG because it is the trendy pattern. Build it when your use case genuinely requires dynamic retrieval over a large, private, frequently-updated knowledge corpus.
Budget for the full stack embedding pipelines, vector databases, reranking, caching, and critically observability. RAG is not a plug-and-play feature. It is a system.
RAG is becoming infrastructure. The teams that build operational discipline around it now will be positioned well as the abstraction layer rises.

If you are building production AI systems and want to share what is actually working in your RAG stack, drop it in the comments. The interesting decisions are always in the engineering details no one writes blog posts about.

CI/CD Pipeline for React Native

Riddhesh — Fri, 15 Mar 2024 12:15:06 +0000

https://code-b.dev/blog/cicd-for-react-native

Use hooks to transform your react native projects

Riddhesh — Wed, 10 Jan 2024 07:46:21 +0000

React Native, since its introduction, has transformed and improved mobile app development and performance through features like hooks.
Hooks enable access to React state and lifecycle features in function components, overcoming the limitations of class components. They offer benefits such as enhanced code reusability, improved readability, smaller component sizes, easier state management, and improved testing.
Key React Native hooks include useState for managing component-specific states, useEffect for handling side effects, useContext for accessing context in functional components, and custom hooks for promoting code reusability by encapsulating common state logic.
These hooks have revolutionized React Native development by simplifying and optimizing the coding process.
Here's an example of how to use the useState hook in a functional component:

import React, { useState } from 'react'; import { View, Text, Button } from 'react-native'; function Counter() { const [count, setCount] = useState(0); return ( <View> <Text>{count}</Text> <Button title="Increment" onPress={() => setCount(count + 1)} /> </View> ); } export default Counter;

P.S.
Felt like doing an informative post as a first post, this is based on a more detailed blog post I made on my website, I don't know if I am allowed to share the link here but anyway, it's this