Why RAG Fails in Enterprise R&D (And What Actually Works)

Gilad Salinger — Tue, 19 May 2026 15:03:50 +0000

Why RAG Fails in Enterprise R&D (And What Actually Works)

RAG was a breakthrough. Embedding documents into vectors, retrieving the most similar chunks at query time, and feeding them to an LLM — it gave models access to external knowledge for the first time. For a customer support bot searching a knowledge base, it's genuinely effective.

But when you deploy RAG into an enterprise R&D environment — with 1,000+ engineers, dozens of interconnected systems, and AI agents that need to take action, not just answer questions — it falls apart in predictable ways.

I'm Gilad Salinger, CEO of Naboo. We build the context layer that replaces RAG for enterprise AI agents. After deploying in production at companies like Global-E (NASDAQ: GLBE) and Melio, I want to share the five specific failure modes we kept seeing — and the architectural approach that fixes them.

The Setup: What Enterprise R&D Actually Looks Like

A typical enterprise engineering organization has context spread across:

Code repositories (GitHub, GitLab, Bitbucket) — often 50+ repos
Project management (Jira, Linear, Asana) — thousands of tickets
Documentation (Confluence, Notion) — much of it outdated
Communication (Slack, Teams) — where real decisions happen
Monitoring (Datadog, Splunk, PagerDuty) — production state
CI/CD (Jenkins, GitHub Actions) — deployment history

When a developer asks an AI coding agent "help me with this ticket," the agent needs to pull context from all of these systems, understand the relationships between them, and filter by what the developer is allowed to see.

RAG can't do this. Here's why.

Failure Mode 1: Context Fragmentation

RAG indexes each data source independently. You get a vector store for your code, another for your Confluence docs, another for Slack messages. But enterprise context is relational. A Jira ticket is meaningless without the code it references, the PR that implements it, and the Slack thread where the team discussed why they chose that approach.

When an agent retrieves the 10 most similar chunks from each source, it gets 30 disconnected fragments. The LLM has to guess how they relate. In our benchmarks, this guessing is where most accuracy loss occurs.

What works instead: Build a cross-system understanding that maps dependencies, ownership, and decision trails across all sources. When the agent queries for context, it gets a coherent package — not scattered chunks.

Failure Mode 2: No Intent Understanding

RAG retrieves based on text similarity. "Fix the authentication bug" and "review the authentication module" would retrieve nearly identical chunks. But these tasks need completely different context.

Fixing a bug requires: the specific error, recent changes to the auth flow, the PR that introduced the regression, relevant test failures. Reviewing a module requires: architectural overview, code ownership, tech debt history, related design documents.

RAG treats both the same because it only understands text distance, not task semantics.

What works instead: Calculate what context is needed based on the task type, current system state, and the user's role. Intent-aware retrieval, not similarity-based retrieval.

Failure Mode 3: Stale Context

Enterprise codebases change constantly. A PR merged 2 hours ago might change the correct approach to a task entirely. But most RAG systems re-index on a schedule — daily, sometimes weekly.

We had a case where a developer asked an AI agent for help refactoring a module. The agent suggested an approach based on the old architecture because the RAG index hadn't caught the PR that changed the module's interface the previous day. The developer spent 3 hours on a dead-end approach before realizing the context was stale.

What works instead: Continuous ingestion that updates the context model in real-time as commits, messages, tickets, and deployments happen. No batch re-indexing delays.

Failure Mode 4: Security as an Afterthought

This one is critical for enterprise. Most RAG implementations index everything into a single vector store, then try to filter results after retrieval. "Post-hoc RBAC."

The problem: vector similarity search doesn't natively support access controls. If a junior developer's query is semantically similar to a document they shouldn't see, the vector DB returns it. The filtering layer has to catch it. And filtering layers have gaps.

In defense, financial services, and healthcare organizations — where a single data access violation can mean regulatory penalties — this architecture is a non-starter.

What works instead: Native RBAC that enforces permissions at retrieval time, not post-retrieval. The context layer inherits permissions from your existing tools (GitHub org roles, Jira project permissions, Confluence space restrictions) and only surfaces context the user is authorized to see.

Failure Mode 5: Token Waste

RAG retrieves by similarity, which means many returned chunks are tangentially relevant at best. In our analysis of production RAG deployments, roughly 60-70% of retrieved chunks contributed nothing to the agent's output. They just consumed tokens.

This matters for three reasons: cost (tokens aren't free at enterprise scale), latency (more tokens = slower responses), and quality (noise in the context window degrades LLM output quality). The "lost in the middle" problem is well-documented — LLMs pay less attention to information in the middle of long contexts.

What works instead: Deliver only the context the agent needs for the specific task. In our benchmarks, this means 90% fewer tokens with 97% higher accuracy. Less is more when the "less" is precisely targeted.

The Architecture That Works: Context Layers

The pattern we landed on — and that we're seeing the industry converge toward — is what we call a context layer. It sits between your data sources and your LLM/agent framework.

Instead of: Query → Embed → Vector search → Top-K chunks → LLM

It's: Query → Intent calculation → Cross-system context assembly → RBAC filtering → Execution-ready context → LLM

The key differences:

Cross-system understanding instead of per-source indexing
Intent-aware retrieval instead of similarity-based retrieval
Continuous ingestion instead of batch re-indexing
Native RBAC instead of post-hoc filtering
Precise context packages instead of top-K similar chunks

Benchmarks

We ran benchmarks using LLM-as-a-judge evaluation across production enterprise environments:

Metric	RAG (baseline)	Context Layer	Delta
Response accuracy	Baseline	+97%	Significant
Token consumption	Baseline	-90%	10x reduction
Response latency	Baseline	10x faster	Faster context assembly

The accuracy improvement comes primarily from the intent-aware retrieval and cross-system relationship mapping. The token reduction comes from delivering only relevant context instead of similarity-matched chunks.

When RAG Is Still Fine

To be clear: RAG isn't wrong, it's scoped. If you're building:

A customer support bot searching a knowledge base
A research assistant querying a document corpus
A chatbot for a small team with a single repo

RAG works. The failure modes above only manifest at enterprise scale — multiple systems, complex permissions, AI agents that need to execute (not just answer), and accuracy requirements where "approximately right" isn't good enough.

What to Try

If you're hitting these failure modes:

Audit your current RAG pipeline for the five failures above. Most enterprise teams are hitting at least 3.
Look at the context layer pattern — whether you build it yourself or use an existing implementation.
Start measuring accuracy with LLM-as-judge, not vibes. The gap between RAG and intent-aware context is only visible when you measure properly.

We've open-sourced our benchmark methodology. If you want to run it against your own data, reach out at naboo.ai or book a technical demo.

Gilad Salinger is CEO & Co-Founder of Naboo, the enterprise context layer for AI agents. Previously founded and scaled a developer tools company. Naboo is backed by Cardumen Capital and 91 Ventures, and is deployed in production at Global-E (NASDAQ: GLBE), Melio, and other enterprise R&D organizations.

DEV Community: Gilad Salinger

Why RAG Fails in Enterprise R&D (And What Actually Works)

Why RAG Fails in Enterprise R&D (And What Actually Works)

The Setup: What Enterprise R&D Actually Looks Like

Failure Mode 1: Context Fragmentation

Failure Mode 2: No Intent Understanding

Failure Mode 3: Stale Context

Failure Mode 4: Security as an Afterthought

Failure Mode 5: Token Waste

The Architecture That Works: Context Layers

Benchmarks

When RAG Is Still Fine

What to Try